Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
About
Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-image Understanding | MMIU | Accuracy49.93 | 65 | |
| Multi-image Grounding | MIG-Bench (test) | Static Score51.52 | 21 | |
| Visual Understanding | BLINK sub-tasks | Jigsaw Accuracy74.67 | 14 | |
| Object Detection | OdinW | mAP (Mean)54.15 | 7 | |
| Multi-image Understanding | BLINK | Accuracy61.18 | 5 | |
| Image Grounding | Ref-L4 | -- | 4 | |
| Multi-image Grounding | ReVOS | Score49.16 | 3 | |
| Multi-image Grounding | MC-Bench | Accuracy58.49 | 3 | |
| Single image Grounding | Lisa | cIoU59.44 | 3 |