ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

About

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy @1634.03	86
Scientific Reasoning	SciKnowEval Chemistry	mean@1645.48	27
Scientific Reasoning	SciKnowEval Biology	Mean@1637.08	24
Scientific Reasoning	SciKnowEval Material	Mean@1662.21	24
Tool Use Reasoning	Tool use	Mean Accuracy @1659.38	24
Scientific Reasoning	SciKnowEval Physics	Mean@1656.98	24
Scientific and tool-use reasoning	SciKnowEval and ToolUse In-domain	Material Score80.18	8

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord