Runtime-Certified Bounded-Error Quantized Attention

About

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

Dean Calver• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	PG-19	--	244
Language Modeling	PG19 32K	Perplexity9.956	8
Long-context Understanding	RULER	Delta Accuracy0.015	4
Long-context Understanding	RULER	Delta Accuracy6.9	4
Information Retrieval	NIAH	Delta Accuracy2	3
Long-context retrieval	NIAH	Delta Accuracy (Δacc)0.00e+0	3
Language Modeling	PG-19 8K context length	Perplexity12.313	2
Language Modeling	PG-19 64K context length	Perplexity9.044	2
Language Modeling	PG-19 128K context length	Perplexity7.245	2
Needle Retrieval	NIAH 8K 10 needles per trial (test)	Accuracy39	2

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord