SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

About

Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented, while most form a long tail of underrepresented relations. This gap leaves VLMs ill-equipped to handle diverse spatial relationships. To bridge it, we construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions in Localized Narratives, DOCCI, and PixMo-Cap. Our dataset consists of 455k samples containing 3.4 million QA pairs. Trained on this dataset, our Spatial-Reasoning Enhanced (SpaRE) VLMs show strong improvements on spatial reasoning benchmarks, achieving up to a 49% performance gain on the What's Up benchmark, while maintaining strong results on general tasks. Our work narrows the gap between human and VLM spatial reasoning and makes VLMs more capable in real-world tasks such as robotics and navigation.

Michael Ogezi, Freda Shi• 2025

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy80.5	962
Multimodal Understanding	MMBench	Accuracy78.6	847
Multi-discipline Multimodal Understanding	MMMU	Accuracy51	363
Visual Question Answering	RealworldQA	Accuracy68.8	259
Comprehensive Multi-modal Evaluation	MME	Total Score145.5	117
Spatial Reasoning	Visual Spatial Reasoning (VSR)	Accuracy85.4	48
3D Spatial Reasoning	3DSRBench	Accuracy57.5	41
Visual Spatial Reasoning	What's Up (Split A)	Accuracy100	20
Visual Spatial Reasoning	What's Up (Split B)	Accuracy100	20
Object Hallucination Evaluation	HallB (Hallucination-Bench)	Score58.2	15

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord