Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

About

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio• 2024

Related benchmarks

TaskDatasetResultRank
Object Goal NavigationHM3D 0.1
SR50.4
35
Multi-Modal Lifelong NavigationGOAT-Bench unseen (val)
SR20.1
22
Object Goal NavigationHM3D (val)
SR50.4
21
Object NavigationHM3D (val)
SR50.4
20
Object NavigationHM3D v0.1
Success Rate (SR)50.4
18
Object NavigationOVON unseen (val)
SR25.5
12
Lifelong Multimodal Object NavigationGOAT-Bench unseen (val)
s-SR0.201
10
Language-conditioned navigationMatterport3D Section A
FGE0.22
6
Language-conditioned navigationScanNet Section B
FGE0.2
6
Language-conditioned navigationScanNet Section C
FGE0.19
6
Showing 10 of 22 rows

Other info

Follow for update