KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
About
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | GSM8K Accuracy (%)95.2 | 204 | |
| Long-context Question Answering | Locomo | F1 (Multi Hop)25.6 | 171 | |
| Long-term Conversation Question Answering | REALTALK | Multi-hop Score38.8 | 84 | |
| Long-context Question Answering | LongMemEval LongConvQA | SH Score73.6 | 84 | |
| Mathematical Reasoning | AIME24 Math | Performance (%)46.7 | 60 | |
| Code Generation | MBPP Code | Performance (%)78.2 | 60 | |
| Long-context Conversational Question Answering | Locomo | Multi-Hop F137.2 | 59 | |
| Multiple-choice Question Answering | MMLU-Pro Law | Accuracy30.8 | 40 | |
| Multiple-choice Question Answering | MMLU-Pro Phys. | Accuracy (%)70.4 | 40 | |
| Multiple-choice Question Answering | MMLU-Pro Chem. | Accuracy70.8 | 40 |