Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

About

Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.

Liuyi Wang, Zongtao He, Jinlong Li, Ruihao Xia, Mengxian Hu, Chenpeng Yao, Chengju Liu, Yang Tang, Qijun Chen• 2025

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R-CE (test-unseen)
SR66
50
Vision-and-Language NavigationR2R-CE (val-seen)
SR73
49
Vision-and-Language NavigationR2R-CE v1.0 (val unseen)
NE (Navigation Error)4.06
19
Vision-and-Language NavigationREVERIE CE (val unseen)
NE6.82
8
Vision-and-Language NavigationREVERIE-CE (val seen)
NE5.38
5
Showing 5 of 5 rows

Other info

Follow for update