TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

About

The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang• 2025

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench (test)	Avg Score52.9	136
Long-context Language Understanding	InfiniteBench	En.Sum24.1	88
Long-context evaluation	RULER 64k	VT Score88	43
Long-context Understanding	InfiniteBench v1 (test)	Dialogue18.5	31
Long-context evaluation	RULER 128k	Query Metric (MQ)98	29
Long-context Understanding	LongBench v1 (test)	SD QA49.3	21
Decoding Latency	Synthetic Context Sequences (test)	Latency (16k Context)0.041	16
Decoding Latency	Llama-3.1-8B 16k sequence length v1 (inference)	Decoding latency (s)0.045	8
Decoding Latency	Llama-3.1-8B 32k sequence length v1 (inference)	Decoding Latency (s)0.047	7
Decoding Latency	Llama-2-7B 16k sequence length v1 (inference)	Decoding Latency (s)0.041	6

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord