Self-Distilled RLVR

About

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVista	Score73.5	566
Interactive Decision-making	AlfWorld	--	398
Mathematical Reasoning	WeMath	Accuracy43.62	317
Mathematical Reasoning	AIME 2024 (test)	--	294
Mathematical Reasoning	MathVerse	Accuracy52.92	266
Web Navigation and Shopping	Webshop	Score87.4	248
Multimodal Math Reasoning	WeMath	Accuracy73.28	228
Multimodal Reasoning	MMMU	Accuracy67.22	220
Multimodal Understanding	MMMU (val)	--	211
Multimodal Reasoning	WeMath	Accuracy58	199

Showing 10 of 72 rows

...

Other info

Follow for update

@wizwand_team Discord