Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

About

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens within single sequence in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves significantly higher theoretical and wall-clock speedup compared to mainstream baselines at moderate pipeline depth, though more aggressive settings require further improvement. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei• 2026

Related benchmarks

Task	Dataset	Result	Rank
General speculative decoding performance	Mean (MT-Bench, HumanEval, GSM8K)	Average Acceptance Length (τ)3.83		112

Showing 1 of 1 rows

Other info

GitHub

Follow for update

@wizwand_team Discord