Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

About

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: https://github.com/siyan-zhao/OPSD.

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
895
Science Question AnsweringScienceQA
Accuracy81.2
791
Question AnsweringARC Challenge
Accuracy (ARC)25.03
598
Mathematical ReasoningAIME 2024
Accuracy0.00e+0
479
Code GenerationMBPP (test)--
405
Mathematical ReasoningMATH 500
Top-1 Accuracy87.11
384
Mathematical ReasoningMinerva
Pass@1 Accuracy34.94
289
Multimodal Math ReasoningWeMath
Accuracy72.36
211
Mathematical ReasoningAIME 2024 (test)--
209
Multimodal ReasoningMMMU
Accuracy63.82
208
Showing 10 of 131 rows
...

Other info

Follow for update