Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

About

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy70.2
425
Streaming Video UnderstandingStreamingBench
Overall77.3
158
Long Video UnderstandingMLVU--
154
Real-Time Visual UnderstandingStreamingBench
Overall Score77.26
96
Video UnderstandingVideo-MME without subtitles--
89
Online Video UnderstandingOVO-Bench
Backward Tracing Avg.52.02
48
Multi-modal Video EvaluationVideoMME--
42
Long Video UnderstandingVideoMME
Accuracy61.4
40
Streaming Video UnderstandingOVOBench Realtime
Average Score61.2
32
Streaming Video UnderstandingOVO-Bench
OCR68.5
32
Showing 10 of 26 rows

Other info

Follow for update