Streaming Video Instruction Tuning

About

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy72.3	635
Temporal Video Understanding	TempCompass	Accuracy71.8	160
General Video Understanding	Video-MME	Accuracy67.9	139
Long-form Video Understanding	LongVideoBench	Accuracy59.2	135
Multi-modal Video Understanding	MVBench	--	84
Streaming Video Understanding	OVO-Bench	Real-Time Visual Perception Avg.67.44	75
Multi-modal Video Evaluation	Video-MME	Accuracy68.7	57
Multi-modal Video Evaluation	VideoMME	--	50
Online Video Understanding	OVO-Bench	Backward Tracing Avg.49.18	48
Real-time Visual Perception	OVO-Bench	OCR78.52	41

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord