Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

About

Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

Junkai Zhang, Hang Guo, Luca Benini, Yawei Li• 2026

Related benchmarks

TaskDatasetResultRank
Long-context language modelingLongBench
Average Score53.82
328
Long-context UnderstandingLongBench 1.0 (test)
NarrativeQA30.41
84
Long-context Language UnderstandingLongBench v1 (test)
NrtvQA Score27
48
Long-context Language UnderstandingLongBench
NrtvQA Score27.75
26
Long-context ReasoningInfiniteBench (test)
Average Score39.46
12
Long-context language modelingRULER Sequence length = 64k
S-NIAH Score (Component 1)100
8
Retrieval and reasoningRULER
Retrieval/Reasoning Score (4K Context)98.62
6
Long-context language modelingRULER Sequence length = 32k
S-NIAH Component 1 Score100
2
Long-context language modelingRULER Sequence length = 16k
S-NIAH Component 1100
2
Showing 9 of 9 rows

Other info

Follow for update