Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

About

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang• 2025

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)57
677
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)56.9
448
Vision-Language NavigationRxR-CE (val-unseen)
SR52.9
426
Vision-and-Language NavigationR2R-CE (val-seen)
SR62
79
Vision-Language NavigationRxR (val-unseen)
Success Rate (SR)56.53
62
Vision-and-Language NavigationR2R-CE v1.0 (val unseen)
SR (Success Rate)57
44
Vision-Language NavigationVLN-CE R2R (val unseen)
Navigation Error (NE)4.98
41
Vision-and-Language NavigationR2R-CE unseen continuous (val)
SR56.9
35
Vision-Language NavigationHA-VLN Unseen (val)
SR33
32
Vertical PerceptionNavSpace
Navigation Error (NE)6
30
Showing 10 of 38 rows

Other info

Follow for update