Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

About

Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons, enabling balanced credit assignment during training. Experimental results on general LVLM, high-resolution, and visual grounding benchmarks validate the effectiveness of Ground-R1 and show that SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang• 2025

Related benchmarks

TaskDatasetResultRank
Visual GroundingRefCOCO+ (val)--
253
Visual GroundingRefCOCO+ (testA)--
245
Visual GroundingRefCOCO+ (testB)--
219
Visual GroundingRefCOCO (val)--
172
Visual GroundingRefCOCO (testA)--
162
Visual GroundingRefCOCO (testB)--
159
Visual GroundingRefCOCOg (val)--
158
Visual GroundingRefCOCOg (test)--
155
High-Resolution Visual ReasoningHR-Bench-8K
Accuracy71.1
28
High-Resolution Visual ReasoningHR-Bench-4K
Accuracy75
15
Showing 10 of 11 rows

Other info

Follow for update