Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

About

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang• 2025

Related benchmarks

Task	Dataset	Result
Spatial Reasoning (Multi-Image)	ERQA	Accuracy21.6	23
Spatial Perception	CV-Bench-3D	Accuracy81.7	14
Multi-frame Spatial Understanding	MultiSPA	Average Score56.11	7
Multimodal Spatial Reasoning	BLINK	Average Accuracy84.3	7
Semantic active perception	ActiveViewPose-200K (val)	Success Rate72.8	4
Semantic active perception	ActiveViewPose-200K (Test1)	Success Rate74.3	4
Semantic active perception	ActiveViewPose-200K (test2)	Success Rate63.6	4
Ego-centric Spatial Reasoning	ERQA	Accuracy36.2	2
Camera Vector Prediction	MultiSPA	Accuracy82	2
Depth Comparison	MultiSPA	Score76	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord