Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

About

On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a sampled-token log-ratio, which can make the learning signal fragile on long rollouts whose prefixes drift away from the teacher's typical support. We revisit this formulation from both theoretical and implementation perspectives. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL minimization, but admits a substantially tighter worst-case variance bound; a controlled synthetic study further shows that stronger future-reward coupling increases gradient variance and destabilizes training. Empirically, we identify three failure modes of sampled-token OPD: imbalanced token-level supervision, unreliable teacher guidance on student-generated prefixes, and tokenizer or special-token mismatch. These findings motivate teacher top-K local support matching, a truncated reverse-KL objective that compares teacher and student distributions over a teacher-supported token set at each prefix, together with top-p rollout sampling and special-token masking. Across single-task reasoning and multi-task benchmarks spanning agentic and reasoning settings, this objective improves optimization stability and yields a +19.8% performance gain over standard sampled-token OPD baselines, providing a practical recipe for more stable on-policy distillation.

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, Dongbin Zhao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	BeyondAIME	Pass@1669	39
Mathematical Reasoning	HMMT25	Avg@16 Accuracy30.6	36
Mathematical Reasoning	AIME25	Pass@1686.6	36
Code Generation	MBPP+	Avg@16 Accuracy72.2	33
Code Generation	HumanEval+	Avg@16 Score87.2	24
Mathematical Reasoning	AMOBench	Pass@1638.5	24
Mathematical Reasoning	AMO-Bench	Average@169.8	24
Agentic Reasoning	ALFWorld (test)	Success Rate97.7	21
Math	AIME25	Avg@1668.5	16
Math	AIME24	Avg@1675.6	16

Showing 10 of 23 rows

Other info

GitHub

Follow for update

@wizwand_team Discord