DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

About

Large language models (LLMs) demonstrate remarkable capabilities but face substantial serving costs due to their high memory demands, with the key-value (KV) cache being a primary bottleneck. State-of-the-art KV cache compression techniques, such as quantization and pruning, apply uniform treatment to both keys and values, and discard unimportant tokens entirely, overlooking the fine-grained distinctions in the significance of individual KV cache components. To address such limitations, we introduce \textit{DiffKV}, a novel framework for efficient KV cache compression that exploits three levels of differentiation in the KV cache: (1) the differing impact of keys and values on attention computation, (2) the varying importance of tokens, and (3) the diverse dynamic sparsity patterns across attention heads. These levels of differentiation introduce irregular memory usage patterns across different requests and attention heads, posing significant scalability challenges for memory management. To address these challenges, DiffKV proposes an on-GPU memory manager that compacts fragmented free memory list into contiguous regions in parallel, effectively translating sparsity in the KV cache into performance gains. We evaluate DiffKV on several mainstream LLMs, including the emerging thinking models that generate extended chains of thought. DiffKV is able to compress the KV cache by $2.7\times$ to $5.7\times$ with near-lossless accuracy on complex workloads requiring sophisticated reasoning and long-generation capabilities, and enhances throughput by $1.9\times$ to $5.4\times$. Source codes of DiffKV are available at https://github.com/zyqCSL/DiffKV.

Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen• 2024

Related benchmarks

Task	Dataset	Result
Long-context Text Summarization	MultiNews 128K context	ROUGE-L28.7	18
Long-context Question Answering	Qasper 128K context	F1 Score38	18
Long-context retrieval and reasoning	RULER 128K context	Accuracy62.5	18
Repository-level code completion	RepoBench-P 128K context	Score58.5	18
Long-context Understanding	L-Eval 32K	P95 Latency (ms)338	12
Summarization	GovReport 16K	95th Percentile Latency (ms)185	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord