Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

About

Large language models (LLMs) demonstrate remarkable capabilities but face substantial serving costs due to their high memory demands, with the key-value (KV) cache being a primary bottleneck. State-of-the-art KV cache compression techniques, such as quantization and pruning, apply uniform treatment to both keys and values, and discard unimportant tokens entirely, overlooking the fine-grained distinctions in the significance of individual KV cache components. To address such limitations, we introduce \textit{DiffKV}, a novel framework for efficient KV cache compression that exploits three levels of differentiation in the KV cache: (1) the differing impact of keys and values on attention computation, (2) the varying importance of tokens, and (3) the diverse dynamic sparsity patterns across attention heads. These levels of differentiation introduce irregular memory usage patterns across different requests and attention heads, posing significant scalability challenges for memory management. To address these challenges, DiffKV proposes an on-GPU memory manager that compacts fragmented free memory list into contiguous regions in parallel, effectively translating sparsity in the KV cache into performance gains. We evaluate DiffKV on several mainstream LLMs, including the emerging thinking models that generate extended chains of thought. DiffKV is able to compress the KV cache by $2.7\times$ to $5.7\times$ with near-lossless accuracy on complex workloads requiring sophisticated reasoning and long-generation capabilities, and enhances throughput by $1.9\times$ to $5.4\times$. Source codes of DiffKV are available at https://github.com/zyqCSL/DiffKV.

Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen• 2024

Related benchmarks

TaskDatasetResultRank
Long-context Text SummarizationMultiNews 128K context
ROUGE-L28.7
18
Long-context Question AnsweringQasper 128K context
F1 Score38
18
Long-context retrieval and reasoningRULER 128K context
Accuracy62.5
18
Repository-level code completionRepoBench-P 128K context
Score58.5
18
Long-context UnderstandingL-Eval 32K
P95 Latency (ms)338
12
SummarizationGovReport 16K
95th Percentile Latency (ms)185
12
Showing 6 of 6 rows

Other info

Follow for update