Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

About

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500--
106
Question AnsweringGPQA Diamond
Pass@139.9
49
Code GenerationLiveCodeBench
Pass@119.23
37
Scientific ReasoningGPQA Diamond
Accuracy (pass@1)59.6
24
Mathematical ReasoningMATH 500
Accuracy (pass@1)91.4
24
Code GenerationLiveCodeBench Jan-Apr 2025
Accuracy (pass@1)41.76
24
Mathematical ReasoningAIME 2024, 2025
Accuracy (pass@1)51.67
24
Mathematical ReasoningAIME 2024, 2025
Accuracy (pass@1)31.67
8
Showing 8 of 8 rows

Other info

Follow for update