StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
About
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-Language Navigation | R2R-CE (val-unseen) | Success Rate (SR)57 | 433 | |
| Vision-and-Language Navigation | R2R (val unseen) | Success Rate (SR)55.74 | 344 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR52.9 | 280 | |
| Vision-and-Language Navigation | R2R-CE (val-seen) | SR62 | 49 | |
| Vision-and-Language Navigation | R2R-CE unseen continuous (val) | SR56.9 | 35 | |
| Vertical Perception | NavSpace | Navigation Error (NE)6 | 30 | |
| Precise Movement | NavSpace | Navigation Error (NE)5.59 | 27 | |
| Vision-Language Navigation | RxR (val-unseen) | Navigation Error (NE)5.71 | 25 | |
| Vision-Language Navigation | HA-VLN Unseen (val) | NE5.59 | 23 | |
| Embodied Navigation | R2R-CE | Navigation Error (NE)4.98 | 19 |