Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
About
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | ODinW-13 | AP21.9 | 98 | |
| Multi-image Understanding | MMIU | Accuracy54.89 | 60 | |
| Multi-image Reasoning | MuirBench | Accuracy57.81 | 48 | |
| Referring Expression Comprehension | RefBench PRO | Acc (Phrase)52.3 | 30 | |
| Multi-image Understanding | BLINK (val) | -- | 23 | |
| Grounding | RGBX-Grounding | LasHeR9.84 | 15 | |
| Multi-image Understanding | MIBench | Accuracy71.42 | 11 | |
| Multi-image Grounding | MIG-Bench (test) | Static Score70.64 | 10 | |
| Multi-image Grounding | MC-Bench (test) | AP50 (Referring)20.3 | 5 | |
| Single image Grounding | OdinW | mAP21.9 | 5 |