Episodic Transformer for Vision-and-Language Navigation

About

Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.

Alexander Pashevich, Cordelia Schmid, Chen Sun• 2021

Related benchmarks

Task	Dataset	Result
Instruction Following	ALFRED (test-unseen)	GC18.56	31
Embodied Task Completion	ALFRED seen (test)	Success Rate (SR)38.42	26
Embodied Instruction Following	ALFRED seen 1.0 (test)	GC45.44	20
Mobile Manipulation	ALFRED seen (test)	Success Rate (SR)38.42	18
Mobile Manipulation	ALFRED (test-unseen)	Success Rate (SR)8.57	18
Aerial Vision-and-Language Navigation	ANDH Seen 1.0 (val)	SPL12.1	14
Interactive Planning	ALFRED unseen (val)	Success Rate (SR)7.32	8
Aerial Vision-and-Language Navigation	ANDH Full Unseen 1.0 (test)	SPL190	7
Aerial Vision-and-Language Navigation	ANDH Unseen 1.0 (val)	SPL14.3	7
Aerial Vision-and-Language Navigation	ANDH Unseen 1.0 (test)	SPL11.3	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord