Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

About

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO's learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Overall	Accuracy80.17	81
Mathematical Reasoning	MATH500	Pass@195	77
Mathematical Reasoning	HMMT Feb 2025	--	45
Mathematical Reasoning	Minerva	Pass@1 (Avg@16)41.3	32
Mathematical Reasoning	AMC23	Avg@1677.6	29
Mathematical Reasoning	Gaokao	--	21
Mathematical Reasoning	Kaoyan	pass@1 Score73.37	19
Mathematical Reasoning	GradeMath	Pass@164.76	19
Mathematical Reasoning	Olympiads	Pass@170.22	19
Mathematical Reasoning	CHMath24	Average Score @1688.33	19

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord