Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
About
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Reasoning | MMMU (val) | Accuracy52.56 | 114 | |
| Image Classification | ImageNet A | Accuracy44.57 | 50 | |
| Multimodal Reasoning | WeMath | Accuracy62.3 | 43 | |
| Multimodal Mathematical Reasoning | MathVista mini (test) | Overall Accuracy74.07 | 33 | |
| Multi-modal Reasoning | MathVision (test) | Accuracy (%)32.96 | 32 | |
| Multimodal Mathematical Reasoning | MathVerse mini (test) | -- | 26 | |
| Image Classification | ImageNet-R | Accuracy61.74 | 8 | |
| Multi-modal Reasoning | MMK12 (test) | Accuracy50.7 | 8 | |
| Visual Grounding | Ref-L4 | Accuracy74.05 | 8 | |
| Visual Grounding | Lisa | Accuracy70.51 | 8 |