End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering
About
We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-Modal Lifelong Navigation | GOAT-Bench unseen (val) | SR20.1 | 22 | |
| Object Goal Navigation | HM3D (val) | SR50.4 | 21 | |
| Object Goal Navigation | HM3D 0.1 | SR50.4 | 18 | |
| Object Navigation | OVON unseen (val) | SR25.5 | 12 | |
| Object Navigation | HM3D (val) | SR50.4 | 4 | |
| Social Navigation | LISN Follow Doctor Arena 3.0 (Scenario a) | Success Rate0.00e+0 | 3 | |
| Social Navigation | LISN Public Area Arena 3.0 (Scenario c) | Success Rate0.55 | 3 | |
| Social Navigation | LISN (Go Forklift Carefully) Arena 3.0 (Scenario e) | Success Rate16.67 | 3 | |
| Social Navigation | LISN Reception Desk Arena 3.0 (Scenario b) | Success Rate13.33 | 3 | |
| Social Navigation | LISN Go Forklift in Hurry Arena 3.0 (Scenario d) | Success Rate33.33 | 3 |