Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

About

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu• 2025

Related benchmarks

TaskDatasetResultRank
Arithmetic ReasoningGSM8K
Accuracy81.07
272
Mathematical ReasoningLG-GSM8K
Accuracy (LG-GSM8K)47.33
17
Document Question AnsweringHotpotQA
Accuracy44.38
13
Arithmetic ReasoningKVFundaBench AR--
5
World KnowledgeKVFundaBench WK--
5
Efficiency EvaluationEfficiency Benchmark A40 4096 4096
Latency (s)162.8
4
LLM Inference EfficiencySynthetic LLM Workload Input 4096 Output 4096
Latency (s)162.8
4
LLM Inference EfficiencySynthetic LLM Workload Input 8192 Output 4096
Latency (s)162.8
2
Showing 8 of 8 rows

Other info

Follow for update