End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

About

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio• 2024

Related benchmarks

Task	Dataset	Result
Object Goal Navigation	HM3D 0.1	SR50.4	35
Multi-Modal Lifelong Navigation	GOAT-Bench unseen (val)	SR20.1	22
Object Goal Navigation	HM3D (val)	SR50.4	21
Object Navigation	HM3D (val)	SR50.4	20
Object Navigation	HM3D v0.1	Success Rate (SR)50.4	18
Object Navigation	OVON unseen (val)	SR25.5	12
Lifelong Multimodal Object Navigation	GOAT-Bench unseen (val)	s-SR0.201	10
Language-conditioned navigation	Matterport3D Section A	FGE0.22	6
Language-conditioned navigation	ScanNet Section B	FGE0.2	6
Language-conditioned navigation	ScanNet Section C	FGE0.19	6

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord