Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

About

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME
Score (Overall)63.5
357
Video UnderstandingEgoSchema--
185
Video UnderstandingLongVideoBench--
123
Video UnderstandingMLVU
Accuracy66
114
General Video UnderstandingLVBench
Accuracy42.2
34
Streaming Video UnderstandingOVO-Bench RealStreamEval protocol
OCR82.9
17
General Video UnderstandingCombined (VideoMME, LVBench, LongVideoBench, EgoSchema, MLVU)
Average Score57.8
11
Showing 7 of 7 rows

Other info

Follow for update