Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

About

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba• 2026

Related benchmarks

Task	Dataset	Result
Retrieval	Needle-in-a-Haystack L=8k	Accuracy40	24
Zero-shot Reasoning	HellaSwag zero-shot 200 items	Accuracy68	17
Language Modeling	WikiText-103	Bits Per Character (BPC)3.04	13
Language Modeling	WikiText-103 50w x 2048 (test)	Perplexity8.83	12
Long-context retrieval	RULER 4k	--	12
Language Modeling	WikiText-103	Perplexity (PPL)5.32	10
Language Modeling	WikiText-103 20w x 2048	Perplexity (PPL)9.621	10
Language Modeling	WikiText-103	Perplexity (PPL)460.4	9
Language Modeling	Mistral-7B Long-context (4k window)	Perplexity5.273	8
Language Modeling	Mistral-7B Long-context (8k window)	Perplexity4.612	8

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord