Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Can Large Reasoning Models Self-Train?

About

Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
836
Mathematical ReasoningAIME 2024 (test)--
209
Math ReasoningGSM8K
Pass@4 Accuracy95
54
Math ReasoningMATH 500
Pass@471
39
Multi-task KnowledgeMMLU-Pro
MMLU-Pro Score0.4383
33
Code GenerationLiveCodeBench
Avg@5 Accuracy18.37
27
Math ReasoningMATH 500
Success Rate (pass@4)85.4
27
Math ReasoningAIME 24
Pass@1633.33
27
Math ReasoningAIME 2024
Accuracy (avg@16)11.25
27
CodeCRUX
Accuracy @552.2
27
Showing 10 of 14 rows

Other info

Follow for update