Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

About

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu• 2025

Related benchmarks

Task	Dataset	Result
Vision-Language Navigation	R2R-CE (val-unseen)	Success Rate (SR)64.3	677
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)64.3	448
Vision-Language Navigation	RxR-CE (val-unseen)	SR61.4	426
Vision-Language Navigation	RxR (val-unseen)	Success Rate (SR)61.4	62
Vision-Language Navigation	VLN-CE R2R (val unseen)	Navigation Error (NE)4.83	41
Vision-Language Navigation	R2R VLN-PE (val unseen)	Navigation Error (NE)4.66	18
Vision-and-Language Navigation	HM3D Simulation	SR (B)58.75	18
Vision-Language Navigation	R2R VLN-PE (val seen)	Navigation Error (NE)4.13	17
Robot navigation	DynaNav	Navigation Error16.45	9
Vision-and-Language Navigation	Fetch Robot Real-World (Standard (B))	Success Rate (SR)40	6

Showing 10 of 15 rows

Other info

GitHub

Follow for update

@wizwand_team Discord