Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

About

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Top-1 Accuracy87.2	452
Mathematical Reasoning	Minerva	Pass@1 Accuracy35.48	289
Mathematical Reasoning	HMMT 2025	--	241
Mathematical Reasoning	AIME 24	Pass@1 Accuracy77.22	153
Mathematical Reasoning	HMMT25	Accuracy (%)52.5	115
Mathematical Reasoning	AIME 2024	Avg @K Score77.92	7
Mathematical Reasoning	AIME 2025	Average Score (avg@K)68.75	7
Mathematical Reasoning	MATH 500	Accuracy (avg@K)87.17	7
Mathematical Reasoning	GSM8K	Accuracy (avg@K)94.26	7
Mathematical Reasoning	Minerva Math	Avg@K32.9	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord