Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

About

Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.

Bob Zhang, Haoran Li, Tao Zhang, Jianan Li, Cilin Yan, Xikai Liu, Jiayin Cai, Yanbin Hao• 2025

Related benchmarks

Task	Dataset	Result
Multi-image Understanding	MMIU	Accuracy49.93	65
Multi-image Grounding	MIG-Bench (test)	Static Score51.52	21
Visual Understanding	BLINK sub-tasks	Jigsaw Accuracy74.67	14
Object Detection	OdinW	mAP (Mean)54.15	7
Multi-image Understanding	BLINK	Accuracy61.18	5
Image Grounding	Ref-L4	--	4
Multi-image Grounding	ReVOS	Score49.16	3
Multi-image Grounding	MC-Bench	Accuracy58.49	3
Single image Grounding	Lisa	cIoU59.44	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord