Runtime-Certified Bounded-Error Quantized Attention
About
KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | PG-19 | -- | 206 | |
| Language Modeling | PG19 32K | Perplexity9.956 | 8 | |
| Long-context Understanding | RULER | Delta Accuracy0.015 | 4 | |
| Long-context Understanding | RULER | Delta Accuracy6.9 | 4 | |
| Information Retrieval | NIAH | Delta Accuracy2 | 3 | |
| Long-context retrieval | NIAH | Delta Accuracy (Δacc)0.00e+0 | 3 | |
| Language Modeling | PG-19 8K context length | Perplexity12.313 | 2 | |
| Language Modeling | PG-19 64K context length | Perplexity9.044 | 2 | |
| Language Modeling | PG-19 128K context length | Perplexity7.245 | 2 | |
| Needle Retrieval | NIAH 8K 10 needles per trial (test) | Accuracy39 | 2 |