StreamingTOM: Streaming Token Compression for Efficient Video Understanding

About

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers $1.2\times$ lower peak memory and $2\times$ faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%$ accuracy and $3.7$ score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.

Xueyi Chen, Keda Tao, Kele Shao, Huan Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	Video-MME without subtitles	Overall Score59.9	108
Long Video Understanding	MLVU (dev)	--	63
Streaming Video Understanding	RVS-Movie	Accuracy53.2	22
Streaming Video Understanding	RVS-Ego	Accuracy58.3	19
Long Video Understanding	EgoSchema (dev)	Accuracy63.7	11

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord