Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

About

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang• 2018

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)43
260
Vision-and-Language NavigationREVERIE (val unseen)
SPL7
129
Vision-Language NavigationR2R (test unseen)
SR63
122
Vision-Language NavigationR2R (val seen)
Success Rate (SR)67
120
Vision-Language NavigationR2R Unseen (test)
SR63
116
Vision-and-Language NavigationR4R unseen (val)
Success Rate (SR)29
52
Vision-and-Language NavigationRoom-to-Room (R2R) Unseen (val)
SR43
52
Object Goal NavigationHM3D-OVON Seen (val)
SR39.2
44
Object Goal NavigationHM3D-OVON unseen (val)
Success Rate18.6
43
NavigationREVERIE Unseen (test)
SR7.84
43
Showing 10 of 61 rows

Other info

Follow for update