Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

About

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark are released at https://www.anjiecheng.me/SpatialRGPT

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu• 2024

Related benchmarks

TaskDatasetResultRank
Object ClassificationCOCO 2017 (val)
Accuracy82.9
23
Spatial ReasoningCV-Bench-3D
Accuracy63.3
21
Egocentric Spatial ReasoningCOCOSPATIAL
Left/Right Accuracy84
19
Allocentric Spatial ReasoningCOMFORT#
Left/Right Accuracy43.08
19
Allocentric Spatial Reasoning3DSRBench
Left/Right Acc36.53
19
Perspective-aware spatial reasoningCOMFORT Visual Illusions
Directional Accuracy (Left/Right)42.31
19
Relative Depth EstimationBLINK RelativeDepth (test)
Accuracy87.9
18
Spatial ReasoningSURDS (test)
Yaw1.3
12
Spatial ReasoningSpatialRGPT-Bench qualitative 1.0 (val test)
Below/Above Accuracy99.17
11
Spatial ReasoningSpatialRGPT-Bench
Open-ended Score92.7
9
Showing 10 of 11 rows

Other info

Code

Follow for update