Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

About

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu• 2026

Related benchmarks

TaskDatasetResultRank
Online Visual-Only Question AnsweringOVO-Bench
OCR95.3
13
Video UnderstandingVideoMME Overall
Accuracy73.5
13
Video UnderstandingVideo-MME Long
Accuracy63.4
12
Visual-Only Question AnsweringStreamingBench Visual-Only QA 1.0 (test)
OP92.6
11
Audio-Visual Question AnsweringVideo Holmes 32 frames
SR64.4
8
Audio-Visual Question AnsweringDaily-Omni 1 FPS
Metric 3070.9
8
Omni-video Question AnsweringSOVBench-O
AV Context: Real-Time Accuracy86.9
8
Audio-Visual Question AnsweringStreamingBench Audio-Visual QA 1.0 (test)
Error Rate (ER)63.6
6
Response TriggeringSOVBench-T
Precision (T=1)86.1
2
Showing 9 of 9 rows

Other info

Follow for update