VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

About

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin• 2026

Related benchmarks

Task	Dataset	Result	Rank
Long-context Understanding	LongBench	HotpotQA11.8		82
Long-context evaluation	RULER	Average Accuracy Score87.31		54

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord