Visual Spatial Tuning

About

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy63.1	511
Optical Character Recognition	OCRBench	Score855	486
Diagram Understanding	AI2D	Accuracy84.9	377
Spatial Reasoning	VSI-Bench	R.Dr.55.8	370
Document Visual Question Answering	DocVQA	Accuracy91.7	203
Spatial Reasoning	EmbSpatial	Overall Accuracy73.7	131
Spatial Reasoning	Viewspatial	Accuracy52.8	129
Visual Perception	MMVP	Accuracy54.7	118
Visual Reasoning	BLINK	Accuracy62.1	116
Multi-modal Understanding	MMBench EN	Accuracy83.3	113

Showing 10 of 54 rows

Other info

GitHub

Follow for update

@wizwand_team Discord