Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

About

We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan• 2022

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)39
433
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)38.9
344
Vision-Language NavigationRxR-CE (val-unseen)
SR15
280
Vision-and-Language NavigationR2R-CE (test-unseen)
SR35
63
Vision-and-Language NavigationR2R-CE (val-seen)
SR47
49
Vision-and-Language NavigationR2R-CE unseen continuous (val)
SR38.9
35
Vision-Language NavigationVLN-CE R2R (val unseen)
Navigation Error (NE)6.28
22
Vision-and-Language NavigationVLN-CE 1.0 (val-unseen)
Navigation Error (NE)6.28
20
Vision-and-Language NavigationVLN-CE 1.0 (val-seen)
Navigation Error (NE)5.65
20
Embodied NavigationR2R-CE
Navigation Error (NE)6.28
19
Showing 10 of 23 rows

Other info

Code

Follow for update