OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
About
Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy. Code is available at https://github.com/DreamSoul-AI/OBCache.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context language modeling | RULER | RULER Score0.3492 | 204 | |
| Long-context evaluation | LongBench | Average Score40.49 | 90 | |
| Long-context evaluation | RULER | Average Accuracy Score25.99 | 54 | |
| Long-context Understanding | LongBench | Average Score45.92 | 38 | |
| Long-context language modeling | RULER 4k | Accuracy82 | 29 | |
| Long-context Understanding | RULER | Average Accuracy71.53 | 27 | |
| Long-context language modeling evaluation | RULER 32k | Average Score (RULER 32K)88.5 | 12 | |
| Long-context language modeling | LongBench | LongBench Average Score44.23 | 12 | |
| Long-context Language Understanding | LongBench avg | LongBench Avg Score41.52 | 5 | |
| Long-context Language Understanding | Ruler (Avg.) | Ruler Avg Score58.47 | 5 |