xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
About
Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8x KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23x end-to-end speedup over the full attention baseline, and surpasses notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy61.9 | 1398 | |
| Multi-task Language Understanding | MMLU | Accuracy63.9 | 353 | |
| Long-context language modeling | LongBench | Average Score42.69 | 328 | |
| Document Question Answering | Qasper | Accuracy35.6 | 44 | |
| Long-context evaluation | RULER 64k | VT Score86.67 | 43 | |
| Key-Value Retrieval | LITM (Lost in the Middle) | Accuracy99.9 | 33 | |
| Variable Tracking | RULER-VT | Accuracy99.8 | 33 | |
| Long-context Language Understanding | RULER 64k context length | FWE (Error)78.47 | 22 | |
| Long-context evaluation | LongBench (test) | NarQA Score32.85 | 18 | |
| Long-context Language Understanding | LongBench 1 host v1 (test) | 2WQA Score39.53 | 14 |