Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

About

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text. We release our code in https://github.com/hemingkx/SWIFT.

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li• 2024

Related benchmarks

TaskDatasetResultRank
Long-context Generation (Reasoning)MMLU-Pro
TPT13.67
20
Long-context Input (Summarization)PG19
TPT6.12
20
Long-context Input (Summarization)BookSum
TPT (s)3.76
20
Long-context Input (Summarization)GovReport
Time Per Token (TPT)8.62
20
Long-context Generation (Reasoning)AIME24
TPT27.54
20
Long-context Generation (Reasoning)AIME25
TPT30.23
20
Mathematical ReasoningGSM8K 20 (test)
Speedup (x)1.39
15
Narrative GenerationTinyStories 21 (test)
Speedup (x)1.62
15
Speculative Decoding EfficiencyCNN/DM, GSM8K, TinyStories Aggregate
Decoding Speed (tokens/s)28.26
15
SummarizationCNN/Daily Mail 19 (test)
Speedup (x)1.43
15
Showing 10 of 10 rows

Other info

Follow for update