Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

About

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan• 2025

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr91.8	391
Spatial Reasoning	VSI-Bench	R.Dr.48.2	370
3D Question Answering	SQA3D (test)	EM@155.9	197
3D Visual Grounding	ScanRefer	Acc@0.2549.3	172
Spatial Reasoning	EmbSpatial	Overall Accuracy50	131
Spatial Reasoning	Viewspatial	Accuracy43.6	129
3D Dense Captioning	Scan2Cap	--	127
Spatial Reasoning	VSI-Bench 1.0 (test)	Average Score48.4	101
Spatial Reasoning	MindCube	Accuracy26.1	91
3D Question Answering	VSI-Bench	Average Score48.4	88

Showing 10 of 117 rows

...

Other info

Follow for update

@wizwand_team Discord