KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

About

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	GSM8K Accuracy (%)95.2	220
Long-context Question Answering	Locomo	F1 (Multi Hop)25.6	174
Long-term Conversation Question Answering	REALTALK	Multi-hop Score38.8	84
Long-context Question Answering	LongMemEval LongConvQA	SH Score73.6	84
Mathematical Reasoning	AIME24 Math	Performance (%)46.7	60
Code Generation	MBPP Code	Performance (%)78.2	60
Long-context Conversational Question Answering	Locomo	Multi-Hop F137.2	59
Multiple-Choice Questions	Four-domain MCQ (test)	Accuracy56.3	43
Multiple-choice Question Answering	MMLU-Pro Law	Accuracy30.8	40
Multiple-choice Question Answering	MMLU-Pro Phys.	Accuracy (%)70.4	40

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord