Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

About

Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce $\textbf{Expected Attention, a training-free compression method}$ that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, $\textbf{we release KVPress, a comprehensive library to enable researchers to implement and benchmark KV cache compression methods, already including more than 20 techniques}$.

Alessio Devoto, Maximilian Jeblick, Simon J\'egou• 2025

Related benchmarks

Task	Dataset	Result
Long-context language modeling	LongBench	Average Score46.47	369
Long-context Understanding	LongBench (test)	Avg Score35.1	166
Long-context evaluation	LongBench	Average Score43.88	96
Long-context Language Understanding	LongBench-e	Average Score48.92	93
Long-context Understanding	RULER 4k (test)	RULER 4k Score95.7	90
Long-context Understanding	RULER 16k (test)	RULER Score93.4	90
Long-context retrieval and aggregation	RULER 32k	Average Accuracy88.15	76
Long-context retrieval and aggregation	RULER 8k	Average Accuracy79.89	76
Long-context retrieval and aggregation	RULER 4k	Average Accuracy80.59	76
Long-context retrieval and aggregation	RULER 16k	Average Accuracy76.6	76

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord