KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

About

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	--	106
Mathematical Reasoning	MATH 500	Accuracy (pass@1)91.4	88
Scientific Reasoning	GPQA Diamond	--	62
Code Generation	LiveCodeBench	Pass@119.23	51
Question Answering	GPQA Diamond	Pass@139.9	49
Code Generation	LiveCodeBench Jan-Apr 2025	Accuracy (pass@1)41.76	24
Mathematical Reasoning	AIME 2024, 2025	Accuracy (pass@1)51.67	24
Mathematical Reasoning	AIME 2024, 2025	Accuracy (pass@1)31.67	20

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord