Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Streaming Video Instruction Tuning

About

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy72.3
563
Temporal Video UnderstandingTempCompass
Accuracy71.8
141
Long-form Video UnderstandingLongVideoBench
Accuracy59.2
135
Multi-modal Video UnderstandingMVBench
Accuracy72.3
83
Streaming Video UnderstandingOVO-Bench
Real-Time Visual Perception Avg.66
56
Online Video UnderstandingOVO-Bench
Backward Tracing Avg.49.18
48
Multi-modal Video EvaluationVideoMME--
42
Real-time Visual PerceptionOVO-Bench
OCR78.52
41
Backward TracingOVO-Bench
EPM51.18
41
Multi-modal Video EvaluationVideo-MME
Accuracy68.7
38
Showing 10 of 22 rows

Other info

Follow for update