StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

About

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang• 2025

Related benchmarks

Task	Dataset	Result
Vision-Language Navigation	R2R-CE (val-unseen)	Success Rate (SR)57	677
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)56.9	448
Vision-Language Navigation	RxR-CE (val-unseen)	SR52.9	426
Vision-and-Language Navigation	R2R-CE (val-seen)	SR62	79
Vision-Language Navigation	RxR (val-unseen)	Success Rate (SR)56.53	62
Vision-and-Language Navigation	R2R-CE v1.0 (val unseen)	SR (Success Rate)57	44
Vision-Language Navigation	VLN-CE R2R (val unseen)	Navigation Error (NE)4.98	41
Vision-and-Language Navigation	R2R-CE unseen continuous (val)	SR56.9	35
Vision-Language Navigation	HA-VLN Unseen (val)	SR33	32
Vertical Perception	NavSpace	Navigation Error (NE)6	30

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord