BEVBert: Multimodal Map Pre-training for Language-guided Navigation
About
Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)59.1 | 266 | |
| Vision-and-Language Navigation | R2R (val unseen) | Success Rate (SR)75 | 260 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR64.4 | 172 | |
| Vision-and-Language Navigation | REVERIE (val unseen) | SPL36.4 | 129 | |
| Vision-Language Navigation | R2R Unseen (test) | SR73 | 116 | |
| Vision-and-Language Navigation | R2R (val seen) | Success Rate (SR)81 | 51 | |
| Vision-and-Language Navigation | R2R-CE (test-unseen) | SR59 | 50 | |
| Vision-and-Language Navigation | R2R-CE (val-seen) | SR70.9 | 49 | |
| Vision-and-Language Navigation | REVERIE Unseen (test) | Success Rate (SR)52.81 | 40 | |
| Vision-and-Language Navigation | R2R (test) | SPL (Success weighted Path Length)60 | 38 |