Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

About

Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME~2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4\%}$ and $\mathbf{0.6\%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
MMLU Accuracy73
77
Mathematical ReasoningAIME 2024
Mean Score (k=8)31.6
59
Mathematical ReasoningAIME 25
Mean@8 Accuracy35.6
21
Mathematical ReasoningAIME 2025
Avg@8 Score25.1
14
Mathematical ReasoningAIME 24
Accuracy@841.6
14
Showing 5 of 5 rows

Other info

Follow for update