SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
About
Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Spatial Visual Question Answering | IaOR-VQA Qualitative | Accuracy87.3 | 9 | |
| Spatial VQA | IaOR-VQA Quantitative | Output Accuracy99.5 | 9 | |
| Task Diversity Assessment | RL Task Diversity Collections | Self-BLEU0.269 | 6 | |
| Pick and Stack | ManiSkill-based manipulation dataset (test) | Pick Success Rate44 | 3 | |
| Visual Question Answering | IaAD-VQA | Accuracy84 | 3 | |
| Visual Question Answering | IrSD-VQA | Accuracy82 | 3 |