SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

About

Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.

Jiahang Liu, Tianyu Xu, Jiawei Chen, Lu Yue, Jiazhao Zhang, Zhiyong Wang, Minghan Li, Qisheng Zhao, Anqi Li, Qi Su, Zhizheng Zhang, He Wang• 2026

Related benchmarks

Task	Dataset	Result
Vision-Language Navigation	R2R-CE (val-unseen)	Success Rate (SR)66.3	779
Vision-Language Navigation	RxR-CE (val-unseen)	SR69.7	512
Vision-Language Navigation	VLN-CE R2R (val unseen)	Navigation Error (NE)4.07	76
Point-Goal navigation	InternVLA-N1 Commercial	Success Rate (SR)91	20
Point-Goal navigation	InternScenes Home (test)	SR90.9	15
SocialNav	MetaUrban 12K (test)	Success Rate (SR)92	9
SocialNav	MetaUrban 12K (Unseen)	Success Rate (SR)93	9
PointNav	MetaUrban 12K (test)	Success Rate (SR)94	9
PointNav	MetaUrban 12K (Unseen)	Success Rate (SR)92	9

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord