CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

About

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. We demonstrate that our algorithm is a universal, plug-and-play enhancement that incurs negligible computational overhead. When integrated with three state-of-the-art cache eviction methods on three distinct LLMs, our algorithm significantly reduces the compression loss by more than \textit{half} on average across 29 datasets from the Ruler and LongBench benchmarks. Further perturbation analysis, at both the head and layer levels, confirms the principles underlying our effectiveness. This work offers a new, formally grounded perspective to cache eviction , opening promising avenues for future research. The code is publicly available at https://github.com/FFY0/DefensiveKV.

Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie• 2025

Related benchmarks

Task	Dataset	Result
Long-context language modeling	LongBench	Average Score48.59	328
Long-context language modeling	RULER	RULER Score0.6917	204
Long-context evaluation	LongBench	Average Score43.46	90
Long-context Language Understanding	InfiniteBench	En.Sum23.34	88
Long-context Understanding	LongBench 1.0 (test)	NarrativeQA32.06	84
Long-context evaluation	RULER	Average Accuracy Score44.13	54
Long-context Understanding	LongBench	Average Score48	38
Long-context language modeling evaluation	RULER	Single-key Accuracy100	29
Long-context Understanding	RULER	Average Accuracy89.29	27
Long-context Understanding	LongBench	Average Score48	26

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord