Target-Driven Structured Transformer Planner for Vision-Language Navigation

About

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP .

Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, Si Liu• 2022

Related benchmarks

Task	Dataset	Result
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)70	448
Vision-and-Language Navigation	REVERIE (val unseen)	SPL27.32	225
Vision-Language Navigation	R2R (test unseen)	SR67	149
Vision-Language Navigation	R2R Unseen (test)	SR67	144
Vision-and-Language Navigation	REVERIE Unseen (test)	Success Rate (SR)35.89	110
Vision-and-Language Navigation	R2R (val seen)	Success Rate (SR)77	68
Vision-Language Navigation	R2R unseen v1.0 (val)	SR70	48
Remote Object Grounding	REVERIE (test unseen)	OSR40.26	38
Remote Object Grounding	REVERIE (val unseen)	OSR39.48	38
Vision-Language Navigation	R2R 1 (test unseen)	Success Rate0.67	29

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord