Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

About

While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

Jianghao Yin, Qingbin Li, Kun Sun, Cheng Ding, Jie Wang, Qin Chen, Jie Zhou, Nan Wang, Changqing Li, Pei Wu, Jian Xu, Zheming Yang, Liang He• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy67.1	563
Mathematical Multimodal Reasoning	MathVerse	Accuracy49.4	259
Mathematical Multimodal Reasoning	MathVista	Accuracy70.1	258
Video Understanding	VideoMME	--	222
Multimodal Reasoning	MathVision	Accuracy26.7	162
Multimodal Reasoning	MMMU-Pro	Accuracy41	146
Multi-image Reasoning	MuirBench	Accuracy71.6	89
Multi-image Reasoning	MIRB	Accuracy55.7	70
Video Reasoning	Video-MMMU	Accuracy51.6	68
Multi-image Understanding	MMIU	Accuracy53.3	65

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord