SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation
About
Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)66.3 | 433 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR69.7 | 280 | |
| Point-Goal navigation | InternScenes Home (test) | SR90.9 | 15 | |
| Point-Goal navigation | InternVLA-N1 Commercial | Success Rate (SR)91 | 9 | |
| SocialNav | MetaUrban 12K (test) | Success Rate (SR)92 | 9 | |
| SocialNav | MetaUrban 12K (Unseen) | Success Rate (SR)93 | 9 | |
| PointNav | MetaUrban 12K (test) | Success Rate (SR)94 | 9 | |
| PointNav | MetaUrban 12K (Unseen) | Success Rate (SR)92 | 9 |