StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
About
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Streaming Video Understanding | StreamingBench | Overall57.12 | 158 | |
| Real-Time Visual Understanding | StreamingBench | Overall Score73.79 | 96 | |
| Long Video Understanding | VideoMME | Accuracy64.4 | 40 | |
| Streaming Video Understanding | OVOBench | Accuracy (Proactive Forwarding)48.4 | 17 | |
| Readiness-aware streaming understanding | ProReady-QA | SSR Accuracy72.2 | 14 | |
| Dense Video Captioning | E.T.Bench | -- | 14 | |
| Online Activation Accuracy | ET-Bench | TVG F135.7 | 10 | |
| Step Localization and Captioning | ET-Bench | F1 Score22.6 | 4 | |
| Temporal Video Grounding | ET-Bench | F1-score34.3 | 4 |