Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

What Limits Vision-and-Language Navigation ?

About

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu• 2026

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)81.1
677
Vision-Language NavigationRxR-CE (val-unseen)
SR67.5
426
Vision-and-Language NavigationOffice Real-world scenario
Success Rate67
12
Vision-and-Language NavigationReal-world scenario Gym
Success Rate83
12
Vision-and-Language NavigationReal-world scenario Lobby
Success Rate78
12
Vision-and-Language NavigationReal-world scenario Outdoor
Success Rate67
12
Showing 6 of 6 rows

Other info

Follow for update