Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Airbert: In-domain Pretraining for Vision-and-Language Navigation

About

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid• 2021

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)75.01
260
Vision-and-Language NavigationREVERIE (val unseen)
SPL21.88
129
Vision-Language NavigationR2R (test unseen)
SR63
122
Vision-Language NavigationR2R (val seen)
Success Rate (SR)75
120
Vision-Language NavigationR2R Unseen (test)
SR77
116
Vision-and-Language NavigationR2R (val seen)
Success Rate (SR)81.4
51
NavigationREVERIE Unseen (test)
SR30.28
43
Vision-and-Language NavigationREVERIE Unseen (test)
Success Rate (SR)30.28
40
NavigationREVERIE (val unseen)
Success Rate (SR)27.89
34
Remote GroundingREVERIE Unseen (test)
RGS16.83
33
Showing 10 of 26 rows

Other info

Code

Follow for update