Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

About

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu• 2026

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)65.4
677
Vision-Language NavigationRxR-CE (val-unseen)
SR67.6
426
Vision-and-Language NavigationReal-world Corridor Simple split
Navigation Error (NE)1.86
3
Vision-and-Language NavigationReal-world Corridor Complex
Navigation Error (NE)2.31
3
Vision-and-Language NavigationReal-world Home Simple
Navigation Error (NE)1.54
3
Vision-and-Language NavigationReal-world Home Complex split
Navigation Error (NE)1.93
3
Vision-and-Language NavigationReal-world Office (Simple split)
Navigation Error (NE)1.77
3
Vision-and-Language NavigationReal-world Office (Complex split)
NE2.26
3
Showing 8 of 8 rows

Other info

Follow for update