Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Runtime-Certified Bounded-Error Quantized Attention

About

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

Dean Calver• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingPG-19--
206
Language ModelingPG19 32K
Perplexity9.956
8
Long-context UnderstandingRULER
Delta Accuracy0.015
4
Long-context UnderstandingRULER
Delta Accuracy6.9
4
Information RetrievalNIAH
Delta Accuracy2
3
Long-context retrievalNIAH
Delta Accuracy (Δacc)0.00e+0
3
Language ModelingPG-19 8K context length
Perplexity12.313
2
Language ModelingPG-19 64K context length
Perplexity9.044
2
Language ModelingPG-19 128K context length
Perplexity7.245
2
Needle RetrievalNIAH 8K 10 needles per trial (test)
Accuracy39
2
Showing 10 of 16 rows

Other info

Follow for update