Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
About
On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a sampled-token log-ratio, which can make the learning signal fragile on long rollouts whose prefixes drift away from the teacher's typical support. We revisit this formulation from both theoretical and implementation perspectives. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL minimization, but admits a substantially tighter worst-case variance bound; a controlled synthetic study further shows that stronger future-reward coupling increases gradient variance and destabilizes training. Empirically, we identify three failure modes of sampled-token OPD: imbalanced token-level supervision, unreliable teacher guidance on student-generated prefixes, and tokenizer or special-token mismatch. These findings motivate teacher top-K local support matching, a truncated reverse-KL objective that compares teacher and student distributions over a teacher-supported token set at each prefix, together with top-p rollout sampling and special-token masking. Across single-task reasoning and multi-task benchmarks spanning agentic and reasoning settings, this objective improves optimization stability and yields a +19.8% performance gain over standard sampled-token OPD baselines, providing a practical recipe for more stable on-policy distillation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Agentic Reasoning | ALFWorld (test) | Success Rate97.7 | 21 | |
| Math Reasoning | OlympiadBench | Pass@143.9 | 14 | |
| Math Reasoning | AIME 25 | Pass@126.7 | 12 | |
| Mathematical Reasoning | AIME 24 | Pass@860 | 6 | |
| Math Reasoning | Minerva | Pass@134.9 | 6 | |
| Math Reasoning | Math Reasoning benchmarks Math500, AIME24, AIME25, Minerva, OlympiadBench (test) | Math500 Accuracy82 | 6 | |
| Math Reasoning | AIME 24 | Pass@1 Score23.3 | 6 | |
| Mathematical Reasoning | AIME 25 | Pass@853.33 | 6 |