DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

About

Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with improved interpretability. However, most LLM-based agents still rely on single-shot action decisions, where the model must choose one option from noisy, textualized multi-perspective observations. Due to local mismatches and imperfect intermediate reasoning, such decisions can easily deviate from the correct path, leading to error accumulation and reduced reliability in unseen environments. In this paper, we propose DV-VLN, a new VLN framework that follows a generate-then-verify paradigm. DV-VLN first performs parameter-efficient in-domain adaptation of an open-source LLaMA-2 backbone to produce a structured navigational chain-of-thought, and then verifies candidate actions with two complementary channels: True-False Verification (TFV) and Masked-Entity Verification (MEV). DV-VLN selects actions by aggregating verification successes across multiple samples, yielding interpretable scores for reranking. Experiments on R2R, RxR (English subset), and REVERIE show that DV-VLN consistently improves over direct prediction and sampling-only baselines, achieving competitive performance among language-only VLN agents and promising results compared with several cross-modal systems.Code is available at https://github.com/PlumJun/DV-VLN.

Zijun Li, Shijie Li, Zhenxi Zhang, Bin Li, Shoujun Zhou• 2026

Related benchmarks

Task	Dataset	Result
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)52	448
Vision-and-Language Navigation	R2R (val seen)	Success Rate (SR)54	68
Vision-and-Language Navigation	RxR English (val unseen)	SR29.16	5

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord