ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

About

Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia• 2025

Related benchmarks

Task	Dataset	Result
Reasoning Segmentation	ReasonSeg (val)	gIoU66.5	382
Reasoning Segmentation	ReasonSeg (test)	gIoU65	287
Generalized Referring Expression Segmentation	gRefCOCO (val)	--	165

Showing 3 of 3 rows

Other info

GitHub

Follow for update

@wizwand_team Discord