Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

About

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, Andrew Markham• 2024

Related benchmarks

TaskDatasetResultRank
Spatial Visual Question AnsweringIaOR-VQA Qualitative
Accuracy87.3
9
Spatial VQAIaOR-VQA Quantitative
Output Accuracy99.5
9
Task Diversity AssessmentRL Task Diversity Collections
Self-BLEU0.269
6
Pick and StackManiSkill-based manipulation dataset (test)
Pick Success Rate44
3
Visual Question AnsweringIaAD-VQA
Accuracy84
3
Visual Question AnsweringIrSD-VQA
Accuracy82
3
Showing 6 of 6 rows

Other info

Code

Follow for update