Draft-based Approximate Inference for LLMs

About

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee• 2025

Related benchmarks

Task	Dataset	Result
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score8.45	532
Long-context Understanding	LongBench (test)	Avg Score49.17	136
Long-context Understanding	RULER 64k	Accuracy64.02	37
Long-context Understanding	RULER 128k	Accuracy52.62	27
Inference Efficiency	LLaMA 8B 8K context length 3.1	Theoretical Compute (TFLOPs)159	10
Inference Efficiency	LLaMA 8B 32K context length 3.1	Theoretical Compute (TFLOPs)1.12e+3	5
Efficiency Analysis	Context Length 4K	Theoretical Compute (TFLOPs)70	5
Efficiency Analysis	Context Length 16K	Theoretical Compute (TFLOPs)398	5
Efficiency Analysis	Context Length 32K	Theoretical Compute (TFLOPs)1.12e+3	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord