InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

About

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy58.5	524
Optical Character Recognition	OCRBench	--	433
Visual Question Answering	RealworldQA	Accuracy54.1	259
Streaming Video Understanding	StreamingBench	--	259
Video Question Answering	VideoMME	Accuracy61.1	251
Video Question Answering	EgoSchema (Full)	Accuracy61.8	241
Long Video Understanding	MLVU	--	205
Video Question Answering	EgoSchema	Accuracy65.8	161
Long Video Understanding	VideoMME	Accuracy62.8	89
Visual Question Answering	MMVP	Accuracy35.3	82

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord