Vision-and-Language Navigation via Causal Learning
About
In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-and-Language Navigation | R2R (val unseen) | Success Rate (SR)78 | 260 | |
| Vision-and-Language Navigation | REVERIE (val unseen) | SPL36.7 | 129 | |
| Vision-Language Navigation | R2R Unseen (test) | SR74.57 | 116 | |
| Vision-and-Language Navigation | R2R (val seen) | Success Rate (SR)83.74 | 51 | |
| Vision-and-Language Navigation | REVERIE Unseen (test) | Success Rate (SR)57.72 | 40 | |
| Vision-and-Language Navigation | REVERIE seen (val) | SR78.64 | 28 | |
| Vision-and-Language Navigation | RxR (Room-Across-Room) unseen (val) | SR (Success Rate)68.2 | 26 | |
| Vision-and-Language Navigation | RxR seen (val) | SR74.1 | 21 |