RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
About
Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context language modeling | LongBench | Average Score53.82 | 328 | |
| Long-context Understanding | LongBench 1.0 (test) | NarrativeQA30.41 | 84 | |
| Long-context Language Understanding | LongBench v1 (test) | NrtvQA Score27 | 48 | |
| Long-context Language Understanding | LongBench | NrtvQA Score27.75 | 26 | |
| Long-context Reasoning | InfiniteBench (test) | Average Score39.46 | 12 | |
| Long-context language modeling | RULER Sequence length = 64k | S-NIAH Score (Component 1)100 | 8 | |
| Retrieval and reasoning | RULER | Retrieval/Reasoning Score (4K Context)98.62 | 6 | |
| Long-context language modeling | RULER Sequence length = 32k | S-NIAH Component 1 Score100 | 2 | |
| Long-context language modeling | RULER Sequence length = 16k | S-NIAH Component 1100 | 2 |