Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

About

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio• 2024

Related benchmarks

TaskDatasetResultRank
Multi-Modal Lifelong NavigationGOAT-Bench unseen (val)
SR20.1
22
Object Goal NavigationHM3D (val)
SR50.4
21
Object Goal NavigationHM3D 0.1
SR50.4
18
Object NavigationOVON unseen (val)
SR25.5
12
Object NavigationHM3D (val)
SR50.4
4
Social NavigationLISN Follow Doctor Arena 3.0 (Scenario a)
Success Rate0.00e+0
3
Social NavigationLISN Public Area Arena 3.0 (Scenario c)
Success Rate0.55
3
Social NavigationLISN (Go Forklift Carefully) Arena 3.0 (Scenario e)
Success Rate16.67
3
Social NavigationLISN Reception Desk Arena 3.0 (Scenario b)
Success Rate13.33
3
Social NavigationLISN Go Forklift in Hurry Arena 3.0 (Scenario d)
Success Rate33.33
3
Showing 10 of 10 rows

Other info

Follow for update