Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

About

Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.

Bob Zhang, Haoran Li, Tao Zhang, Jianan Li, Cilin Yan, Xikai Liu, Jiayin Cai, Yanbin Hao• 2025

Related benchmarks

TaskDatasetResultRank
Multi-image UnderstandingMMIU
Accuracy49.93
65
Multi-image GroundingMIG-Bench (test)
Static Score51.52
21
Visual UnderstandingBLINK sub-tasks
Jigsaw Accuracy74.67
14
Object DetectionOdinW
mAP (Mean)54.15
7
Multi-image UnderstandingBLINK
Accuracy61.18
5
Image GroundingRef-L4--
4
Multi-image GroundingReVOS
Score49.16
3
Multi-image GroundingMC-Bench
Accuracy58.49
3
Single image GroundingLisa
cIoU59.44
3
Showing 9 of 9 rows

Other info

Follow for update