Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

About

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.

You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionODinW-13
AP21.9
98
Multi-image UnderstandingMMIU
Accuracy54.89
60
Multi-image ReasoningMuirBench
Accuracy57.81
48
Referring Expression ComprehensionRefBench PRO
Acc (Phrase)52.3
30
Multi-image UnderstandingBLINK (val)--
23
GroundingRGBX-Grounding
LasHeR9.84
15
Multi-image UnderstandingMIBench
Accuracy71.42
11
Multi-image GroundingMIG-Bench (test)
Static Score70.64
10
Multi-image GroundingMC-Bench (test)
AP50 (Referring)20.3
5
Single image GroundingOdinW
mAP21.9
5
Showing 10 of 13 rows

Other info

Follow for update