Batch Speculative Decoding Done Right

About

Speculative decoding must produce outputs distribution identical to standard autoregressive generation-this output equivalence is not an optimization target but the defining criterion of valid speculative decoding. We demonstrate that all existing batch speculative decoding implementations violate this fundamental requirement, producing corrupted outputs ranging from repetitive tokens to gibberish. These failures stem from the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, desynchronizing position IDs, attention masks, and KV-cache state. We present the first authentic batch speculative decoding framework. We (1) formalize the synchronization invariants that valid batch speculative decoding must satisfy, (2) present EQSPEC, the first algorithm that guarantees output equivalence, and analyze its cost structure to show that alignment overhead grows superlinearly and consumes up to 40\% of computation, and (3) introduce EXSPEC, which reduces this overhead through cross-batch scheduling that dynamically groups same-length sequences. On SpecBench across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B pairs, our methods achieve up to 3x throughput improvement at batch size 8 while maintaining algorithmic correctness. Our methods achieve 95\% decoding-equivalence, with residual divergence attributable to floating-point non-determinism in GPU inference, not the synchronization failures that cause near-zero equivalence of prior methods. Our code is available at https://github.com/eBay/spec_dec.

Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang• 2025

Related benchmarks

Task	Dataset	Result
Output Equivalence	Vicuna	Exact Match97.3	13
Output Equivalence	Qwen3	Exact Match0.95	13
Output Equivalence	GLM4	Exact Match96.7	11

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord