RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

About

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7\% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3.97x reduction in peak memory usage, supports 5.75x larger batch sizes, and achieves a 2.32x speedup in decoding stage.

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan• 2025

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench (test)	Avg Score52.89	136
Mathematical Reasoning	MATH 500	--	106
Long-context Language Understanding	LongBench-e	Average Score42.95	93
Mathematical Reasoning	MATH 500	Accuracy (pass@1)93.4	88
Scientific Reasoning	GPQA Diamond	--	62
Visual Question Answering	DocVQA	ANLS92.76	59
Code Generation	LiveCodeBench	Pass@118.33	51
Question Answering	GPQA Diamond	Pass@142.93	49
Mathematical Reasoning	AIME 2024, 2025	Accuracy (pass@1)61.67	24
Code Generation	LiveCodeBench Jan-Apr 2025	Accuracy (pass@1)42.86	24

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord