ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
About
Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)66.4 | 677 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR69.3 | 426 | |
| Object Goal Navigation | HM3D-OVON Seen (val) | SR55.3 | 65 | |
| Object Goal Navigation | HM3D-OVON unseen (val) | Success Rate54 | 57 | |
| Object Goal Navigation | HM3D-OVON Seen-Synonyms (val) | SR55.4 | 56 | |
| Vision-Language Navigation | VLN-CE R2R (val unseen) | Navigation Error (NE)3.78 | 41 | |
| Person-Following | EVT-Bench Single-Target Tracking (STT) single view | SR86.9 | 9 | |
| Person-Following | EVT-Bench single view (Distracted Tracking) | SR66.7 | 9 | |
| Person-Following | EVT-Bench Ambiguity Tracking (AT) single view | Success Rate (SR)67.3 | 8 | |
| Open-Vocabulary Navigation | HM3D OVON | Success Rate (SR)54 | 8 |