Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

About

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText	PPL8.94	740
Language Modeling	WikiText v1 (test)	Perplexity13.33	30
Multi-task Language Understanding	LM Evaluation Harness (test)	ARC Challenge Acc44.2	24
Language Modeling	FineWeb-Edu 500M-token (val)	Valid Loss2.221	18
Long-context Language Understanding	RULER 128k	--	10
Commonsense Reasoning	Commonsense Reasoning (HellaSwag, OBQA, WinoGrande, ARC, PIQA)	HellaSwag52.28	5
Downstream Task Evaluation	MNLI, SCIQ, LAMBADA, HellaSwag, ARC, MMLU	MNLI Acc0.3852	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord