Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

About

While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

Jianghao Yin, Qingbin Li, Kun Sun, Cheng Ding, Jie Wang, Qin Chen, Jie Zhou, Nan Wang, Changqing Li, Pei Wu, Jian Xu, Zheming Yang, Liang He• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy67.1
247
Video UnderstandingVideoMME--
192
Multi-image UnderstandingMMIU
Accuracy53.3
60
Multi-image ReasoningMIRB
Accuracy55.7
60
Multimodal ReasoningMMMU-Pro
Accuracy41
55
Multi-image ReasoningMuirBench
Accuracy71.6
48
Mathematical Multimodal ReasoningMathVista
Accuracy70.1
46
Video ReasoningVideo-MMMU
Accuracy51.6
32
Multimodal ReasoningM3CoT (test)
Total Acc63.9
31
Mathematical Multimodal ReasoningMathVerse
Accuracy49.4
29
Showing 10 of 17 rows

Other info

Follow for update