Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

About

Vision-Language Navigation (VLN) is a task where agents learn to navigate following natural language instructions. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches exploit the vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have neglected the rich semantic information contained in the environment (such as implicit navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, estimating the navigation progress, predicting the next orientation, and evaluating the trajectory consistency. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activity and build a thorough perception of the environment. Our experiments indicate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. Empirically, we demonstrate that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang• 2019

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)55
260
Vision-Language NavigationR2R (test unseen)
SR71
122
Vision-Language NavigationR2R (val seen)
Success Rate (SR)70
120
Vision-Language NavigationR2R Unseen (test)
SR71
116
Vision-and-Language NavigationRoom-to-Room (R2R) Unseen (val)
SR55
52
Vision-and-Language NavigationRoom-to-Room (R2R) Seen (val)
NE (Navigation Error)3.33
32
Vision-and-Language NavigationRoom-to-Room (R2R) (test unseen)
SR55
24
Vision-Language NavigationR2R unseen v1.0 (val)
SR55
24
Vision-Language NavigationR2R 1 (test unseen)
Success Rate0.55
18
Instruction FollowingR2R Unseen (test)
Success Rate (SR)55
11
Showing 10 of 16 rows

Other info

Code

Follow for update