Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MindCube: Spatial Mental Modeling from Limited Views

About

Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models naturally, internal representations of unseen space, to reason about layout, perspective, and motion. Our MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help approximate spatial mental models in VLMs, focusing on incorporating unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 57.8% (+20.0%). Adding reinforcement learning pushed performance even further to 61.3% (+23.5%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, Manling Li• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningVSI-Bench
Avg Score17.2
192
Spatial ReasoningViewspatial
Accuracy24.1
92
Spatial ReasoningMindCube
Accuracy51.7
69
Multimodal Spatial IntelligenceEASI (In-Domain)
Average Score20.6
32
Multiple Choice AnsweringVIEW2SPACE v1
Accuracy30.21
27
Visual CountingVIEW2SPACE v1
MAE4.52
27
Visual GroundingVIEW2SPACE v1
mIoU0.12
27
Multi-view spatial reasoningMindCube (tiny)
Overall Accuracy60.76
24
Spatial ReasoningCV-Bench 2D
Accuracy43.1
22
Object ReasoningOrthoMind-3D
Object Count63.6
20
Showing 10 of 16 rows

Other info

Follow for update