Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

About

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	--	45
Mathematical Reasoning	Minerva	Avg@1626.1	42
Mathematical Reasoning	HMMT25	Avg@16 Accuracy17.7	36
Mathematical Reasoning	Math Reasoning AIME24, AIME25, HMMT25	AIME24 Score77.1	30
Mathematical Reasoning	AIME24	Pass@1643.8	30
Math Reasoning	AMC23	Mean Score @1292.1	28
Logical reasoning	Knight-and-Knaves	Pass@1 Rate (4-8 Roles, ID)75	28
Math Reasoning	AIME24	Mean@12 Score75.3	28
Math Reasoning	AIME 25	Mean@12 Score65.6	28
Mathematical Problem Solving	AMC 2023	Accuracy (avg@k)93.12	27

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord