Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

About

Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity10.73
1071
Language ModelingPG-19
Perplexity10.46
160
Long-context UnderstandingLongBench
HotpotQA11.49
82
Question Answering and Commonsense ReasoningShort-context benchmarks ARC-C, ARC-E, PIQA, Winogrande, HellaSwag
ARC-C Accuracy44.75
17
Showing 4 of 4 rows

Other info

Follow for update