Multi-Rollout On-Policy Distillation via Peer Successes and Failures
About
Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tool Use Reasoning | ToolAlpaca | Accuracy66.73 | 20 | |
| Mathematical Reasoning | AIME 2025 | Avg@8 (e_m)25.41 | 20 | |
| Tool Use | ToolUse | -- | 18 | |
| Code Generation | LiveCodeBench v6 | Mean@861.82 | 8 | |
| Code Generation | LiveCodeBench | Mean@861.92 | 7 | |
| Math Reasoning | AIME 2024 | Mean@8 Score28.54 | 4 | |
| Math Reasoning | HMMT Feb 25 | Mean @818.5 | 4 | |
| Math Reasoning | HMMT25 Nov. | Mean@815.83 | 4 | |
| Scientific Question Answering | SciKnowEval | Accuracy (Biology)55.69 | 4 | |
| Question Answering | Science QA | Accuracy (Biology)55.69 | 2 |