Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

KV Cache Transform Coding for Compact Storage in LLM Inference

About

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Konrad Staniszewski, Adrian {\L}a\'ncucki• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy62.5
1362
Multi-task Language UnderstandingMMLU
Accuracy64.6
321
Mathematical ReasoningMATH 500
pass@174.41
239
MathematicsAIME 2024
Accuracy52.5
60
Document Question AnsweringQasper
Accuracy40.7
44
Key-Value RetrievalLITM (Lost in the Middle)
Accuracy99.9
33
Variable TrackingRULER-VT
Accuracy99.5
33
Long-context Language UnderstandingLongBench 1 host v1 (test)
2WQA Score46.23
14
CodingLiveCodeBench
Accuracy36.5
8
Long-context Language UnderstandingRULER 0 shot v1 (test)
CWE Score92.41
7
Showing 10 of 13 rows

Other info

Follow for update