Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
About
We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Retrieval | Needle-in-a-Haystack L=8k | Accuracy40 | 24 | |
| Zero-shot Reasoning | HellaSwag zero-shot 200 items | Accuracy68 | 17 | |
| Language Modeling | WikiText-103 | Bits Per Character (BPC)3.04 | 13 | |
| Language Modeling | WikiText-103 50w x 2048 (test) | Perplexity8.83 | 12 | |
| Long-context retrieval | RULER 4k | -- | 12 | |
| Language Modeling | WikiText-103 | Perplexity (PPL)5.32 | 10 | |
| Language Modeling | WikiText-103 20w x 2048 | Perplexity (PPL)9.621 | 10 | |
| Language Modeling | WikiText-103 | Perplexity (PPL)460.4 | 9 | |
| Language Modeling | Mistral-7B Long-context (4k window) | Perplexity5.273 | 8 | |
| Language Modeling | Mistral-7B Long-context (8k window) | Perplexity4.612 | 8 |