Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

About

Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy65.4	563
Streaming Video Understanding	StreamingBench	Overall26	259
Long Video Understanding	LVBench	Accuracy42	218
Video Question Answering	MLVU	Accuracy66.3	194
Video Question Answering	EgoSchema	Accuracy68.2	161
Real-Time Visual Understanding	StreamingBench	Overall Score23.23	134
Video Understanding	MLVU	Accuracy66.3	114
Video Question Answering	LVBench	Accuracy42	108
Video Understanding	Video-MME without subtitles	Overall Score61.2	108
Long Video Understanding	MLVU (dev)	--	63

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord