Multi-Rollout On-Policy Distillation via Peer Successes and Failures

About

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu• 2026

Related benchmarks

Task	Dataset	Result
Tool Use Reasoning	ToolAlpaca	Accuracy66.73	20
Mathematical Reasoning	AIME 2025	Avg@8 (e_m)25.41	20
Tool Use	ToolUse	--	18
Scientific Question Answering	SciKnowEval	Accuracy (Biology)55.69	15
Code Generation	LiveCodeBench v6	Mean@861.82	8
Code Generation	LiveCodeBench	Mean@861.92	7
Math Reasoning	AIME 2024	Mean@8 Score28.54	4
Math Reasoning	HMMT Feb 25	Mean @818.5	4
Math Reasoning	HMMT25 Nov.	Mean@815.83	4
Question Answering	Science QA	Accuracy (Biology)55.69	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord