Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation
About
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)64.3 | 266 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR61.4 | 172 | |
| Robot navigation | DynaNav | Navigation Error16.45 | 9 | |
| Vision-Language Navigation | R2R VLN-PE (val seen) | Trajectory Length (TL)10.65 | 7 | |
| Vision-Language Navigation | R2R VLN-PE (val unseen) | Trajectory Length (TL)10.09 | 7 | |
| Robot navigation | Real-world Navigation Tasks v1 (test) | Success Rate50 | 6 | |
| Vision-and-Language Navigation | R2R VLN (val unseen) | Navigation Error (NE)4.05 | 2 | |
| Vision-and-Language Navigation | Social-VLN R2R (val unseen) | Navigation Error (NE)5.97 | 2 |