Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
About
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText | PPL8.94 | 479 | |
| Multi-task Language Understanding | LM Evaluation Harness (test) | ARC Challenge Acc44.2 | 24 | |
| Language Modeling | FineWeb-Edu 500M-token (val) | Valid Loss2.221 | 18 | |
| Language Modeling | WikiText v1 (test) | Perplexity13.33 | 18 | |
| Long-context Language Understanding | RULER 128k | Average Score42.31 | 4 | |
| Downstream Task Evaluation | MNLI, SCIQ, LAMBADA, HellaSwag, ARC, MMLU | MNLI Acc0.3852 | 2 |