The Spatial Blindspot of Vision-Language Models

About

Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy38.448	524
Multimodal Understanding	MME	MME Score0.582	207
Multimodal Understanding	MMMU (val)	MMMU Score33.4	199
Counting	TallyQA	Accuracy71	67
Spatial Reasoning	Visual Spatial Reasoning (VSR)	Accuracy60.311	48
Counting	countbenchqa	Accuracy73.9	45
Multimodal Understanding	SEED Bench Img	SEEDB Score59.5	32
Cultural Multimodal Understanding	CCBench	Score0.114	20
Spatial Understanding	TopViewRS	Accuracy0.371	15
Spatial Understanding	MMVP	Accuracy56	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord