PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
About
Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME~2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4\%}$ and $\mathbf{0.6\%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | MMLU Accuracy73 | 77 | |
| Mathematical Reasoning | AIME 2024 | Mean Score (k=8)31.6 | 59 | |
| Mathematical Reasoning | AIME 25 | Mean@8 Accuracy35.6 | 21 | |
| Mathematical Reasoning | AIME 2025 | Avg@8 Score25.1 | 14 | |
| Mathematical Reasoning | AIME 24 | Accuracy@841.6 | 14 |