Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

About

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training DeepSeek-R1. However, GRPO fails to update the policy when all responses within a group are incorrect (i.e., \emph{all-negative-sample} groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO's learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMinerva
Pass@1 (Avg@16)41.3
32
Mathematical ReasoningAMC23
Avg@1677.6
29
Mathematical ReasoningHMMT Feb 2025--
23
Mathematical ReasoningAIME 2024
Average@1631.2
15
Mathematical ReasoningAIME 2025
Average@1622.3
15
Mathematical ReasoningMATH 500
Average@1687.8
15
Mathematical ReasoningOlympiadBench
Average@1659.3
15
Mathematical ReasoningHmmt feb-2024
Average@1613.7
15
Showing 8 of 8 rows

Other info

Follow for update