Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
About
On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy87.2 | 384 | |
| Mathematical Reasoning | Minerva | Pass@1 Accuracy35.48 | 289 | |
| Mathematical Reasoning | HMMT 2025 | -- | 194 | |
| Mathematical Reasoning | AIME 24 | Pass@1 Accuracy77.22 | 128 | |
| Mathematical Reasoning | HMMT25 | Accuracy (%)52.5 | 115 | |
| Mathematical Reasoning | AIME 2024 | Avg @K Score77.92 | 7 | |
| Mathematical Reasoning | AIME 2025 | Average Score (avg@K)68.75 | 7 | |
| Mathematical Reasoning | MATH 500 | Accuracy (avg@K)87.17 | 7 | |
| Mathematical Reasoning | GSM8K | Accuracy (avg@K)94.26 | 7 | |
| Mathematical Reasoning | Minerva Math | Avg@K32.9 | 7 |