Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

About

Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons, enabling balanced credit assignment during training. Experimental results on general LVLM, high-resolution, and visual grounding benchmarks validate the effectiveness of Ground-R1 and show that SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang• 2025

Related benchmarks

Task	Dataset	Result
Visual Grounding	RefCOCO+ (val)	--	264
Visual Grounding	RefCOCO+ (testA)	--	256
Visual Grounding	RefCOCO+ (testB)	--	230
Visual Grounding	RefCOCO (val)	--	177
Visual Grounding	RefCOCO (testA)	--	167
Visual Grounding	RefCOCO (testB)	--	164
Visual Grounding	RefCOCOg (val)	--	163
Visual Grounding	RefCOCOg (test)	--	160
High-Resolution Visual Reasoning	HR-Bench-8K	Accuracy71.1	28
High-Resolution Visual Reasoning	HR-Bench-4K	Accuracy75	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord