RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

About

Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

Junkai Zhang, Hang Guo, Luca Benini, Yawei Li• 2026

Related benchmarks

Task	Dataset	Result
Long-context language modeling	LongBench	Average Score53.82	369
Long-context Understanding	LongBench 1.0 (test)	NarrativeQA30.41	108
Long-context Language Understanding	LongBench v1 (test)	NrtvQA Score27	48
Long-context Language Understanding	LongBench	NrtvQA Score27.75	37
Long-context Reasoning	InfiniteBench (test)	Average Score39.46	12
Long-context language modeling	RULER Sequence length = 64k	S-NIAH Score (Component 1)100	8
Retrieval and reasoning	RULER	Retrieval/Reasoning Score (4K Context)98.62	6
Long-context language modeling	RULER Sequence length = 32k	S-NIAH Component 1 Score100	2
Long-context language modeling	RULER Sequence length = 16k	S-NIAH Component 1100	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord