StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

About

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu• 2026

Related benchmarks

Task	Dataset	Result
Online Visual-Only Question Answering	OVO-Bench	OCR95.3	13
Video Understanding	VideoMME Overall	Accuracy73.5	13
Video Understanding	Video-MME Long	Accuracy63.4	12
Visual-Only Question Answering	StreamingBench Visual-Only QA 1.0 (test)	OP92.6	11
Audio-Visual Question Answering	Video Holmes 32 frames	SR64.4	8
Audio-Visual Question Answering	Daily-Omni 1 FPS	Metric 3070.9	8
Omni-video Question Answering	SOVBench-O	AV Context: Real-Time Accuracy86.9	8
Audio-Visual Question Answering	StreamingBench Audio-Visual QA 1.0 (test)	Error Rate (ER)63.6	6
Response Triggering	SOVBench-T	Precision (T=1)86.1	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord