Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

About

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin• 2026

Related benchmarks

TaskDatasetResultRank
Long-context UnderstandingLongBench
HotpotQA11.8
82
Long-context evaluationRULER
Accuracy (Context 4k)92.88
34
Showing 2 of 2 rows

Other info

Follow for update