Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
About
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Arithmetic Reasoning | GSM8K | Accuracy81.07 | 272 | |
| Mathematical Reasoning | LG-GSM8K | Accuracy (LG-GSM8K)47.33 | 17 | |
| Document Question Answering | HotpotQA | Accuracy44.38 | 13 | |
| Arithmetic Reasoning | KVFundaBench AR | -- | 5 | |
| World Knowledge | KVFundaBench WK | -- | 5 | |
| Efficiency Evaluation | Efficiency Benchmark A40 4096 4096 | Latency (s)162.8 | 4 | |
| LLM Inference Efficiency | Synthetic LLM Workload Input 4096 Output 4096 | Latency (s)162.8 | 4 | |
| LLM Inference Efficiency | Synthetic LLM Workload Input 8192 Output 4096 | Latency (s)162.8 | 2 |