Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UFT: Unifying Supervised and Reinforcement Fine-Tuning

About

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar• 2025

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU-Pro
Accuracy49.4
201
Mathematical ReasoningIn-Distribution Reasoning Performance Suite (AIME, AMC, MATH-500, Minerva, Olympiad)
AIME 2024 Score24.8
112
General ReasoningOut-of-Distribution Performance Suite (ARC-c, GPQA*, MMLU-Pro) (test)
ARC-c Score82.2
66
Math ReasoningAIME 2025
Accuracy16.5
60
Math ReasoningAIME 2024
Accuracy20.8
39
Math ReasoningMATH 500
Accuracy83.8
38
General ReasoningARC-C
Accuracy83.4
35
Math ReasoningAMC
Accuracy (%)58.8
23
Math ReasoningOlympiad
Accuracy (Olympiad)51.6
11
General Domain ReasoningGPQA
Accuracy34.5
11
Showing 10 of 11 rows

Other info

Follow for update