Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Draft-based Approximate Inference for LLMs

About

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee• 2025

Related benchmarks

TaskDatasetResultRank
Multi-turn Dialogue EvaluationMT-Bench
Overall Score8.45
447
Long-context UnderstandingLongBench (test)
Avg Score49.17
136
Long-context UnderstandingRULER 64k
Accuracy64.02
25
Long-context UnderstandingRULER 128k
Accuracy52.62
15
Inference EfficiencyLLaMA 8B 8K context length 3.1
Theoretical Compute (TFLOPs)159
10
Inference EfficiencyLLaMA 8B 32K context length 3.1
Theoretical Compute (TFLOPs)1.12e+3
5
Efficiency AnalysisContext Length 4K
Theoretical Compute (TFLOPs)70
5
Efficiency AnalysisContext Length 16K
Theoretical Compute (TFLOPs)398
5
Efficiency AnalysisContext Length 32K
Theoretical Compute (TFLOPs)1.12e+3
5
Showing 9 of 9 rows

Other info

Follow for update