UFT: Unifying Supervised and Reinforcement Fine-Tuning

About

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar• 2025

Related benchmarks

Task	Dataset	Result
General Reasoning	MMLU-Pro	Accuracy49.4	213
Mathematical Reasoning	In-Distribution Reasoning Performance Suite (AIME, AMC, MATH-500, Minerva, Olympiad)	AIME 2024 Score24.8	119
General Reasoning	Out-of-Distribution Performance Suite (ARC-c, GPQA*, MMLU-Pro) (test)	ARC-c Score82.2	73
Math Reasoning	AIME 2025	Accuracy16.5	60
Math Reasoning	AIME 2024	Accuracy20.8	39
Math Reasoning	MATH 500	Accuracy83.8	38
General Reasoning	ARC-C	Accuracy83.4	35
Math Reasoning	AMC	Accuracy (%)58.8	23
Math Reasoning	Olympiad	Accuracy (Olympiad)51.6	11
General Domain Reasoning	GPQA	Accuracy34.5	11

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord