SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

About

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	895
Mathematical Reasoning	MATH 500	Accuracy (Acc)40.1	543
Mathematical Reasoning	AMC	Accuracy (%)14.3	368
Mathematical Reasoning	Minerva Math	Accuracy15.3	233
Mathematical Reasoning	Olympiad Bench	Accuracy24.4	222
Mathematical Reasoning	MATH 500	Accuracy45.4	221
Mathematical Reasoning	AIME 2024 (test)	--	209
Mathematical Reasoning	MATH 500	Accuracy72.2	116
Scientific Reasoning	ARC Challenge	--	115
Mathematical Reasoning	Minerva Math	Accuracy32.4	104

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord