RL Fine-Tuning Heals OOD Forgetting in SFT
About
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) is a standard post-training recipe for improving Large Language Models (LLM) reasoning, but why it works remains unclear. We revisit the common claim that ``SFT memorizes, RL generalizes'' through checkpoint-wise analyses of in-distribution (ID) and out-of-distribution (OOD) reasoning. We find that OOD performance often peaks early during SFT and then declines despite continued improvement in ID reasoning. RL typically does not surpass this early SFT peak; rather, it restores OOD capability lost during later SFT, and only from a bounded range of SFT checkpoints. Further spectral analysis shows that this forgetting-and-recovery pattern correlates with rotations of singular vectors, while singular values remain largely stable. These findings suggest a more precise view of post-training dynamics: SFT can forget, RL can recover, and controlling singular-vector rotation may improve OOD robustness. Code is available at \href{https://github.com/jinhangzhan/RL\_Heals\_SFT.git}{https://github.com/jinhangzhan/RL\_Heals\_SFT}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | ARC Challenge | Accuracy (ARC)0.92 | 34 | |
| General Knowledge | MMLU-Pro | Accuracy73 | 33 | |
| Instruction Following | IFEval loose | -- | 18 |