StreamingTOM: Streaming Token Compression for Efficient Video Understanding
About
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers $1.2\times$ lower peak memory and $2\times$ faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%$ accuracy and $3.7$ score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | Video-MME without subtitles | Overall Score59.9 | 89 | |
| Long Video Understanding | MLVU (dev) | -- | 63 | |
| Streaming Video Understanding | RVS-Movie | Accuracy53.2 | 22 | |
| Streaming Video Understanding | RVS-Ego | Accuracy58.3 | 19 | |
| Long Video Understanding | EgoSchema (dev) | Accuracy63.7 | 11 |