Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

About

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Top-1 Accuracy87.2
384
Mathematical ReasoningMinerva
Pass@1 Accuracy35.48
289
Mathematical ReasoningHMMT 2025--
194
Mathematical ReasoningAIME 24
Pass@1 Accuracy77.22
128
Mathematical ReasoningHMMT25
Accuracy (%)52.5
115
Mathematical ReasoningAIME 2024
Avg @K Score77.92
7
Mathematical ReasoningAIME 2025
Average Score (avg@K)68.75
7
Mathematical ReasoningMATH 500
Accuracy (avg@K)87.17
7
Mathematical ReasoningGSM8K
Accuracy (avg@K)94.26
7
Mathematical ReasoningMinerva Math
Avg@K32.9
7
Showing 10 of 10 rows

Other info

Follow for update