Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

About

Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbol{\rho}$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.

Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge, Changlin Li, Junhan Zhao, Zhihui Li, Xiaojun Chang• 2026

Related benchmarks

TaskDatasetResultRank
Real-Time Visual UnderstandingStreamingBench
Overall Score71.6
134
Long Video UnderstandingVideo-MME Overall
Accuracy67.7
53
Long Video UnderstandingMLVU 3-120 min
Accuracy68.3
36
Long Video UnderstandingVideoMME Long split, 30-60 min
Accuracy56.4
27
Online Video UnderstandingOVOBench 1.0 (test)
Real-Time Perception64.7
27
Real-time StreamingOVO-Bench
RTVP64.7
17
Real-time StreamingStreamingBench
RTVU71.6
15
Long Video UnderstandingLVBench 30~120 min
Accuracy43.6
9
Long Video UnderstandingLongVideoBench (8 sec~60 min)
Accuracy62
7
Showing 9 of 9 rows

Other info

Follow for update