GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

About

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai• 2026

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench 1.0 (test)	NarrativeQA26.51	108
Long-context Language Understanding	LongBench	NtrvQA29.24	22
Long-context Language Understanding	RULER 16K 1.0 (test)	CWE Score59.84	18
Long-context Language Understanding	RULER 16k	CWE Score77.4	18

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord