Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

About

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy18
983
Commonsense ReasoningCSQA
Accuracy77
366
Mathematical ReasoningGSM8K
Accuracy (GSM8K)82
358
Question AnsweringOBQA
Accuracy84
276
Mathematical ReasoningMATH500
Accuracy (ACC)33
133
Logical reasoningFOLIO
Accuracy39
119
Question AnsweringStrategyQA
Accuracy88
114
Reading ComprehensionDROP
DROP Accuracy13
103
Commonsense ReasoningOBQA
Accuracy38
75
Logical reasoningStrategyQA
Accuracy76
58
Showing 10 of 15 rows

Other info

Follow for update