Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

About

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu• 2025

Related benchmarks

Task	Dataset	Result
Arithmetic Reasoning	GSM8K	Accuracy81.07	272
Mathematical Reasoning	LG-GSM8K	Accuracy (LG-GSM8K)47.33	17
Document Question Answering	HotpotQA	Accuracy44.38	13
Arithmetic Reasoning	KVFundaBench AR	--	5
World Knowledge	KVFundaBench WK	--	5
Efficiency Evaluation	Efficiency Benchmark A40 4096 4096	Latency (s)162.8	4
LLM Inference Efficiency	Synthetic LLM Workload Input 4096 Output 4096	Latency (s)162.8	4
LLM Inference Efficiency	Synthetic LLM Workload Input 8192 Output 4096	Latency (s)162.8	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord