EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

About

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank KV compression methods reduce this footprint by modifying model projections, limiting the flexibility to switch back to standard full-cache inference when sufficient memory is available. In this paper, we propose EchoKV, a flexible KV cache compression framework that supports on-demand transitions from full KV caching to compressed caching. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the discarded KV components from a partial subset, exploiting intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a lightweight two-stage fine-tuning strategy, requiring only a few minutes on a single A100 GPU for a 7B model. Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across multiple compression ratios and backbone models while preserving the throughput of full-cache inference in short-context scenarios.

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che• 2026

Related benchmarks

Task	Dataset	Result	Rank
Long-context Language Understanding	LongBench	M-Avg49.67		294
Long-context Understanding	RULER	VT Score97.4		24

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord