Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
About
In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)35 | 266 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR8 | 172 | |
| Vision-and-Language Navigation | R2R-CE (val-seen) | SR37 | 49 | |
| Vision-and-Language Navigation | VLN-CE 1.0 (val-unseen) | Navigation Error (NE)6.83 | 20 | |
| Vision-and-Language Navigation | VLN-CE 1.0 (val-seen) | Navigation Error (NE)6.35 | 20 | |
| Vision-Language Navigation | RxR-Habitat English (val seen) | Trajectory Length6.27 | 3 | |
| Vision-Language Navigation | RxR-Habitat English Unseen (val) | Trajectory Length (TL)4.01 | 3 |