Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

About

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)10.306
2320
Video GenerationCausVid
LPIPS0.038
30
Language ModelingC4
Perplexity12.893
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 4k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 8k context length
Needle-in-a-Haystack Recall (8k Context)100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 16k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 32k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 64k context length
Recall100
16
Multi-key needle-in-a-haystack recallMulti-key needle-in-a-haystack 128k context length
Recall100
16
KV cache reconstructionIsotropic Gaussian keys/queries d=128 averaged over 64 seeds (synthetic)
Cosine Similarity0.9965
15
Showing 10 of 13 rows

Other info

Follow for update