Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

About

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (\model), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at https://github.com/yongliang-wu/DFT.

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
625
Mathematical ReasoningGSM8K
Accuracy (GSM8K)88.98
358
Mathematical ReasoningCollegeMATH
Accuracy48.5
276
Mathematical ReasoningMATH 500
pass@185.4
239
Mathematical ReasoningMATH 500
Accuracy81.97
119
Scientific Question AnsweringGPQA Diamond
Accuracy43.81
64
Mathematical ReasoningOlympiadBench
Pass Rate45.8
36
Multi-Turn Medical DialogueMedQA
Accuracy48.8
32
Multi-Turn Medical DialogueMedicalExam
Accuracy51.86
32
Multi-Turn Medical DialogueMedMCQA
Accuracy42.2
32
Showing 10 of 27 rows

Other info

Follow for update