Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolarQuant: Quantizing KV Caches with Polar Transformation

About

Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods.

Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)10.473
2320
Video GenerationCausVid
LPIPS0.0369
30
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 4k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 8k context length
Needle-in-a-Haystack Recall (8k Context)100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 16k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 32k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 64k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 128k context length
Recall100
16
Language ModelingC4
Perplexity13.091
16
Autoregressive audio (AAR)AudioSet 20k (subset of 100 random 10 s clips)
Compression Ratio2.75
15
Showing 10 of 13 rows

Other info

Follow for update