Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

About

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO's learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500
Pass@195
60
Mathematical ReasoningOverall
Accuracy80.17
36
Mathematical ReasoningMinerva
Pass@1 (Avg@16)41.3
32
Mathematical ReasoningAMC23
Avg@1677.6
29
Mathematical ReasoningHMMT Feb 2025--
28
Mathematical ReasoningKaoyan
pass@1 Score73.37
19
Mathematical ReasoningGradeMath
Pass@164.76
19
Mathematical ReasoningOlympiads
Pass@170.22
19
Mathematical ReasoningGaokao
Average Score (@16)87.11
19
Mathematical ReasoningCHMath24
Average Score @1688.33
19
Showing 10 of 15 rows

Other info

Follow for update