Can Large Reasoning Models Self-Train?

About

Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	854
Mathematical Reasoning	AIME 2024 (test)	--	294
Math Reasoning	GSM8K	Pass@4 Accuracy95	54
Math Reasoning	MATH 500	Pass@471	39
Multi-task Knowledge	MMLU-Pro	MMLU-Pro Score0.4383	33
Code Generation	LiveCodeBench	Avg@5 Accuracy18.37	27
Math Reasoning	MATH 500	Success Rate (pass@4)85.4	27
Math Reasoning	AIME 24	Pass@1633.33	27
Math Reasoning	AIME 2024	Accuracy (avg@16)11.25	27
Code	CRUX	Accuracy @552.2	27

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord