Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

About

In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.

Tianfu Li, Wenbo Chen, Haoxuan Xu, Xinhu Zheng, Haoang Li• 2026

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)62
433
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)76
344
Vision-Language NavigationRxR-CE (val-unseen)
SR58.01
280
Vision-and-Language NavigationREVERIE (val unseen)
SPL36.78
173
Vision-and-Language NavigationREVERIE Unseen (test)
Success Rate (SR)60.06
59
Vision-and-Language NavigationRxR (Room-Across-Room) unseen (val)
SR (Success Rate)69.2
32
Showing 6 of 6 rows

Other info

Follow for update