Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

About

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMMMU (val)
Accuracy52.56
114
Image ClassificationImageNet A
Accuracy44.57
50
Multimodal ReasoningWeMath
Accuracy62.3
43
Multimodal Mathematical ReasoningMathVista mini (test)
Overall Accuracy74.07
33
Multi-modal ReasoningMathVision (test)
Accuracy (%)32.96
32
Multimodal Mathematical ReasoningMathVerse mini (test)--
26
Image ClassificationImageNet-R
Accuracy61.74
8
Multi-modal ReasoningMMK12 (test)
Accuracy50.7
8
Visual GroundingRef-L4
Accuracy74.05
8
Visual GroundingLisa
Accuracy70.51
8
Showing 10 of 10 rows

Other info

Follow for update