On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

About

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (\model), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at https://github.com/yongliang-wu/DFT.

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy57	836
Science Question Answering	ScienceQA	--	791
Mathematical Reasoning	MATH 500	Accuracy (Acc)63	543
Mathematical Reasoning	AIME 2024	Accuracy5.7	479
Mathematical Reasoning	AMC	Accuracy (%)30.3	368
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)88.98	358
Mathematical Reasoning	CollegeMATH	Accuracy48.5	327
Mathematical Reasoning	Countdown	Accuracy4.6	252
Mathematical Reasoning	MATH 500	pass@185.4	239
Coding	HumanEval	Pass@160.4	168

Showing 10 of 105 rows

...

Other info

Follow for update

@wizwand_team Discord