Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

About

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Long-context language modelingLongBench
Average Score43.3
328
Mathematical ReasoningAIME 25
Pass@1 Accuracy13.33
178
Long-context Question AnsweringLocomo
F1 (Multi Hop)28.3
171
Long-context UnderstandingLongBench v2--
133
Long-context Language UnderstandingLongBench-e
Average Score44.97
93
Long-context UnderstandingRULER 4k (test)
RULER 4k Score95.3
90
Long-context UnderstandingRULER 16k (test)
RULER Score92.9
90
Long-term Conversation Question AnsweringREALTALK
Multi-hop Score33
84
Long-context Question AnsweringLongMemEval LongConvQA
SH Score70.7
84
Showing 10 of 34 rows

Other info

Follow for update