SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

About

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU63.1	699
Visual Question Answering	RealworldQA	Accuracy79.6	327
3D Semantic Segmentation	ScanNet V2 (val)	mIoU70.6	230
Monocular Depth Estimation	KITTI	--	220
Semantic segmentation	Pascal VOC	mIoU90.9	214
Document Visual Question Answering	DocVQA	Accuracy97.1	203
Monocular Depth Estimation	NYU V2	--	192
Visual Question Answering	MMMU	Accuracy76.4	101
Visual Question Answering	OCRBench	Score909	53
Embodied Visual Question Answering	ERQA	Accuracy51.5	39

Showing 10 of 17 rows

Other info

GitHub

Follow for update

@wizwand_team Discord