Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

About

A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

Hao Tan, Licheng Yu, Mohit Bansal• 2019

Related benchmarks

Task	Dataset	Result
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)52	448
Vision-Language Navigation	R2R (val seen)	Success Rate (SR)64.4	150
Vision-Language Navigation	R2R (test unseen)	SR69	149
Vision-Language Navigation	R2R Unseen (test)	SR69	144
Vision-and-Language Navigation	R2R (val seen)	Success Rate (SR)62	68
Vision-and-Language Navigation	R4R unseen (val)	Success Rate (SR)34.7	60
Vision-and-Language Navigation	Room-to-Room (R2R) Unseen (val)	SR52	52
Vision-and-Language Navigation	R2R (test)	SPL (Success weighted Path Length)47	51
Vision-Language Navigation	R2R unseen v1.0 (val)	SR52	48
Vision-and-Language Navigation	Room-to-Room (R2R) Seen (val)	NE (Navigation Error)3.99	32

Showing 10 of 25 rows

Other info

Code

Follow for update

@wizwand_team Discord