Draft-based Approximate Inference for LLMs
About
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score8.45 | 447 | |
| Long-context Understanding | LongBench (test) | Avg Score49.17 | 136 | |
| Long-context Understanding | RULER 64k | Accuracy64.02 | 25 | |
| Long-context Understanding | RULER 128k | Accuracy52.62 | 15 | |
| Inference Efficiency | LLaMA 8B 8K context length 3.1 | Theoretical Compute (TFLOPs)159 | 10 | |
| Inference Efficiency | LLaMA 8B 32K context length 3.1 | Theoretical Compute (TFLOPs)1.12e+3 | 5 | |
| Efficiency Analysis | Context Length 4K | Theoretical Compute (TFLOPs)70 | 5 | |
| Efficiency Analysis | Context Length 16K | Theoretical Compute (TFLOPs)398 | 5 | |
| Efficiency Analysis | Context Length 32K | Theoretical Compute (TFLOPs)1.12e+3 | 5 |