Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

About

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

Gergely Szilvasy, Manuel Faysse, Maria Lomeli, Matthijs Douze, Pierre-Emmanuel Mazar\'e, Lo\"ic Cabannes, Wen-tau Yih, Herv\'e J\'egou (1) __INSTITUTION_8__ Meta FAIR, (2) MICS, CentraleSup\'elec)• 2026

Related benchmarks

TaskDatasetResultRank
Long-context retrievalRULER 8k context NIAH Single 1--
9
Commonsense Question AnsweringCSQA
Score0.676
2
Commonsense ReasoningWinoGrande
Score71.7
2
Multi-task Language UnderstandingMMLU
MMLU Score56
2
Question AnsweringARC Challenge
Score50.9
2
Question AnsweringARC Easy
Score76
2
Question AnsweringNatural Questions (NQ)
Score0.227
2
Reading ComprehensionBoolQ
Score73
2
Code GenerationMBPP
Score37.2
2
Commonsense ReasoningHellaSwag
Score78.4
2
Showing 10 of 20 rows

Other info

Follow for update