Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Spatial Blindspot of Vision-Language Models

About

Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy38.448
374
Multimodal UnderstandingMME
MME Score0.582
158
Multimodal UnderstandingMMMU (val)
MMMU Score33.4
111
Spatial ReasoningVisual Spatial Reasoning (VSR)
Accuracy60.311
48
Multimodal UnderstandingSEED Bench Img
SEEDB Score59.5
32
CountingTallyQA
Accuracy71
28
Countingcountbenchqa
Accuracy73.9
28
Cultural Multimodal UnderstandingCCBench
Score0.114
20
Spatial UnderstandingTopViewRS
Accuracy0.371
15
Spatial UnderstandingMMVP
Accuracy56
15
Showing 10 of 11 rows

Other info

Follow for update