PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

About

Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME~2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4\%}$ and $\mathbf{0.6\%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang• 2026

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	MMLU Accuracy73	132
Mathematical Reasoning	AIME 2024	Mean Score (k=8)31.6	81
Mathematical Reasoning	AIME 24	--	55
Mathematical Reasoning	AIME 25	Mean@8 Accuracy35.6	21
Mathematical Reasoning	AIME 2025	Avg@8 Score25.1	14

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord