On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
About
In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (\model), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at https://github.com/yongliang-wu/DFT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | IFEval | -- | 625 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)88.98 | 358 | |
| Mathematical Reasoning | CollegeMATH | Accuracy48.5 | 276 | |
| Mathematical Reasoning | MATH 500 | pass@185.4 | 239 | |
| Mathematical Reasoning | MATH 500 | Accuracy81.97 | 119 | |
| Scientific Question Answering | GPQA Diamond | Accuracy43.81 | 64 | |
| Mathematical Reasoning | OlympiadBench | Pass Rate45.8 | 36 | |
| Multi-Turn Medical Dialogue | MedQA | Accuracy48.8 | 32 | |
| Multi-Turn Medical Dialogue | MedicalExam | Accuracy51.86 | 32 | |
| Multi-Turn Medical Dialogue | MedMCQA | Accuracy42.2 | 32 |