A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
About
Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-and-language navigation (VLN), existing approaches often face a trade-off between reasoning capability and deployment efficiency on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and strong high-level reasoning on real-world robots. The system is decomposed into a fast perception-action layer and a deep reasoning layer running asynchronously at different time scales, with a shared memory layer enabling efficient interaction between them. To support long-horizon reasoning, we incrementally construct a compact memory graph and progressively feed decomposed subgraphs into a vision-language model (VLM). Furthermore, we formulate exploration as a Weighted Traveling Repairman Problem (WTRP) by jointly considering reasoning outcomes and the spatial distribution of candidate regions. Extensive experiments in simulation and real-world environments demonstrate improved navigation success and efficiency over existing VLN approaches while maintaining real-time performance on resource-constrained hardware. Code and additional real-world experiments are available at https://github.com/xukuanHIT/HiCo-Nav.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Goal Navigation | MP3D | SR48.5 | 129 | |
| Object Navigation | HM3D | Success Rate (SR)61 | 110 | |
| Open-set ObjectGoal Navigation | HM3D-OVON unseen (val) | SR52.4 | 49 | |
| Text Navigation | TextNav | Success Rate (SR)27.8 | 14 |