Streaming Video Instruction Tuning
About
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy72.3 | 563 | |
| Temporal Video Understanding | TempCompass | Accuracy71.8 | 141 | |
| Long-form Video Understanding | LongVideoBench | Accuracy59.2 | 135 | |
| Multi-modal Video Understanding | MVBench | Accuracy72.3 | 83 | |
| Streaming Video Understanding | OVO-Bench | Real-Time Visual Perception Avg.66 | 56 | |
| Online Video Understanding | OVO-Bench | Backward Tracing Avg.49.18 | 48 | |
| Multi-modal Video Evaluation | VideoMME | -- | 42 | |
| Real-time Visual Perception | OVO-Bench | OCR78.52 | 41 | |
| Backward Tracing | OVO-Bench | EPM51.18 | 41 | |
| Multi-modal Video Evaluation | Video-MME | Accuracy68.7 | 38 |