TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

About

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)10.34	2862
Language Modeling	WikiText-103 (test)	Perplexity9.22	773
Code Generation	HumanEval	pass@163.42	329
Code Generation	MBPP	Pass@159.8	211
Perplexity	WikiText-2	Perplexity352.3	113
KV Cache Quantization	WikiText-2 prompt 2048-token (train)	Softmax KL Divergence0.0063	99
Long-context Language Understanding	LongBench-e	Average Score43.2	93
Top-10 Attended-Token Overlap	Attention Diagnostics	Top-10 Attended Overlap91.2	90
Mathematical Reasoning	Minerva MATH500	Score21.2	36
Mathematical Reasoning	GSM8K	Flexible Extract Score48.9	35

Showing 10 of 52 rows

Other info

Follow for update

@wizwand_team Discord