Topological Planning with Transformers for Vision-and-Language Navigation

About

Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel V\'azquez, Silvio Savarese• 2020

Related benchmarks

Task	Dataset	Result
Vision-Language Navigation	R2R-CE (val-unseen)	Success Rate (SR)26.4	677
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)26.4	448
Vision-and-Language Navigation	R2R-CE (val-seen)	SR36	79
Vision-and-Language Navigation	R2R-CE v1.0 (val unseen)	SR (Success Rate)26	44
Vision-Language Navigation	VLN-CE R2R (val unseen)	Navigation Error (NE)7.9	41
Vision-and-Language Navigation	R2R-CE unseen continuous (val)	SR26.4	35
Vision-and-Language Navigation	VLN-CE 1.0 (val-seen)	Navigation Error (NE)6.6	20
Vision-and-Language Navigation	VLN-CE 1.0 (val-unseen)	Navigation Error (NE)7.9	20
Vision-and-Language Navigation	Room-to-Room (R2R) VLN-CE (val unseen)	Navigation Error (NE)7.9	17

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord