KV Cache Transform Coding for Compact Storage in LLM Inference

About

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Konrad Staniszewski, Adrian {\L}a\'ncucki• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy62.5	1398
Multi-task Language Understanding	MMLU	Accuracy64.6	353
Mathematical Reasoning	MATH 500	pass@174.41	239
Mathematics	AIME 2024	Accuracy52.5	60
Document Question Answering	Qasper	Accuracy40.7	44
Coding	LiveCodeBench	Accuracy36.5	38
Key-Value Retrieval	LITM (Lost in the Middle)	Accuracy99.9	33
Variable Tracking	RULER-VT	Accuracy99.5	33
Long-context Language Understanding	LongBench 1 host v1 (test)	2WQA Score46.23	14
Long-context Language Understanding	RULER 0 shot v1 (test)	CWE Score92.41	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord