Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

About

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy (GSM8K)88.98
358
Instruction FollowingIFEval
Accuracy (0-100)49.86
292
Mathematical ReasoningCollegeMATH
Accuracy48.5
161
Mathematical ReasoningMATH 500
pass@185.4
153
Mathematical ReasoningMATH 500
Accuracy81.97
119
Scientific Question AnsweringGPQA Diamond
Accuracy43.81
64
Mathematical ReasoningOlympiadBench
Pass Rate45.8
36
Mathematical ReasoningAIME25
Pass@818.3
29
Multi-task performance evaluationGPQA-Diamond, GSM8K, MATH-500, AIME’24, and IFEval Aggregate
Avg Score56.38
25
Mathematical ReasoningGSM8K
Pass Rate96
20
Showing 10 of 16 rows

Other info

Follow for update