Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

About

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionODinW-13
AP41.1
98
Multi-image UnderstandingMMIU
Accuracy55.01
60
Multi-image ReasoningMuirBench
Accuracy58.2
48
Multi-image UnderstandingBLINK (val)--
23
Multi-image UnderstandingMIBench
Accuracy70.16
11
Multi-image GroundingMIG-Bench (test)
Static Score76.89
10
Multi-image GroundingMC-Bench (test)
AP50 (Referring)42
5
Single image GroundingOdinW
mAP41.1
5
Single image GroundingLLMSeg
mIoU50.9
4
Video GroundingReasonVOS
mIoU0.6441
4
Showing 10 of 11 rows

Other info

Follow for update