Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

About

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie• 2025

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy69.9
189
Multiple-choice Question AnsweringMMLU-Pro
MMLU-Pro Overall Accuracy21.56
116
Visual Mathematical ReasoningMathVerse
Accuracy48.9
73
Visual Mathematical ReasoningMathVision
Accuracy26.3
63
Visual ReasoningV*Bench
Accuracy56.54
58
Multimodal ReasoningMMMU-Pro
Accuracy39.5
55
Visual Mathematical ReasoningWeMath
Accuracy67.7
53
Mathematical Multimodal ReasoningMathVista
Accuracy68
46
Multimodal ReasoningM3CoT (test)
Total Acc61.3
31
Multimodal Math ReasoningMathVision
Accuracy26.4
31
Showing 10 of 40 rows

Other info

Follow for update