Scaling Data Generation in Vision-and-Language Navigation

About

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao• 2023

Related benchmarks

Task	Dataset	Result
Vision-Language Navigation	R2R-CE (val-unseen)	Success Rate (SR)55	779
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)81	476
Vision-and-Language Navigation	REVERIE (val unseen)	SPL43.5	237
Vision-Language Navigation	R2R Unseen (test)	SR77	144
Vision-and-Language Navigation	REVERIE Unseen (test)	Success Rate (SR)56.1	110
Vision-Language Navigation	VLN-CE R2R (val unseen)	Navigation Error (NE)4.8	76
Vision-and-Language Navigation	R2R (val seen)	Success Rate (SR)81	68
Vision-and-Language Navigation	R2R-CE (test-unseen)	SR55	63
Vision-and-Language Navigation	R2R-CE v1.0 (val unseen)	SR (Success Rate)55	61
Navigation	REVERIE Unseen (test)	SR56.1	51

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord