Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-modal Map Learning for Vision and Language Navigation

About

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, Kostas Daniilidis• 2022

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)34.3
266
Vision-Language NavigationRxR-CE (val-unseen)
SR14.4
172
Vision-and-Language NavigationR2R-CE (test-unseen)
SR31
50
Vision-and-Language NavigationR2R-CE (val-seen)
SR43
49
Vision-and-Language NavigationVLN-CE 1.0 (val-seen)
Navigation Error (NE)4.81
20
Vision-and-Language NavigationVLN-CE 1.0 (val-unseen)
Navigation Error (NE)6.23
20
Embodied NavigationR2R-CE
Navigation Error (NE)7.02
19
Vision-and-Language NavigationR2R-CE v1.0 (val unseen)
NE (Navigation Error)7.02
19
Vision-and-Language NavigationVLN-CE (test-unseen)
Navigation Error (NE)7.74
17
Vision-Language NavigationVLN-CE March 8th 2022 (test-unseen)
Trajectory Length (TL)13.9
6
Showing 10 of 10 rows

Other info

Code

Follow for update