Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

About

Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringEgoSchema
Accuracy68.2
161
Streaming Video UnderstandingStreamingBench
Overall26
158
Video Question AnsweringMLVU
Accuracy66.3
143
Long Video UnderstandingLVBench
Accuracy42
133
Video Question AnsweringLVBench
Accuracy42
108
Real-Time Visual UnderstandingStreamingBench
Overall Score23.23
96
Video UnderstandingVideo-MME without subtitles
Overall Score61.2
89
Long Video UnderstandingMLVU (dev)--
63
Online Video UnderstandingOVO-Bench
Backward Tracing Avg.27.38
48
Video Multimodal UnderstandingVideo-MME
Score61.2
33
Showing 10 of 17 rows

Other info

Follow for update