Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
About
Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training DeepSeek-R1. However, GRPO fails to update the policy when all responses within a group are incorrect (i.e., \emph{all-negative-sample} groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO's learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | Minerva | Pass@1 (Avg@16)41.3 | 32 | |
| Mathematical Reasoning | AMC23 | Avg@1677.6 | 29 | |
| Mathematical Reasoning | HMMT Feb 2025 | -- | 23 | |
| Mathematical Reasoning | AIME 2024 | Average@1631.2 | 15 | |
| Mathematical Reasoning | AIME 2025 | Average@1622.3 | 15 | |
| Mathematical Reasoning | MATH 500 | Average@1687.8 | 15 | |
| Mathematical Reasoning | OlympiadBench | Average@1659.3 | 15 | |
| Mathematical Reasoning | Hmmt feb-2024 | Average@1613.7 | 15 |