Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Streaming Video Instruction Tuning

About

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy72.3
247
Long-form Video UnderstandingLongVideoBench
Accuracy59.2
82
Temporal Video UnderstandingTempCompass
Average Score71.8
52
Multi-modal Video UnderstandingMVBench--
39
Multi-modal Video EvaluationVideo-MME
Accuracy68.7
38
Online Video UnderstandingOVO-Bench
OCR79.19
30
Multi-modal Video EvaluationVideoMME--
30
Streaming Video UnderstandingOVOBench Realtime
Average Score72.2
17
Streaming Video UnderstandingOVO-Bench 1.0 (test)
OCR0.8255
13
Online Video UnderstandingOVO Backward
Score46.1
13
Showing 10 of 13 rows

Other info

Follow for update