Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

About

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang• 2026

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench--
292
Long-context language modeling evaluationRULER Context Length = 8K
Average Accuracy (RULER 8K)89.47
72
Long-context language modelingLongBench (test)
Qasper Score41.5
29
Long-context language modeling evaluationRULER 32k
S1 Score (RULER 32K)100
7
Showing 4 of 4 rows

Other info

Follow for update