Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Batch Speculative Decoding Done Right

About

Speculative decoding must produce outputs distribution identical to standard autoregressive generation-this output equivalence is not an optimization target but the defining criterion of valid speculative decoding. We demonstrate that all existing batch speculative decoding implementations violate this fundamental requirement, producing corrupted outputs ranging from repetitive tokens to gibberish. These failures stem from the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, desynchronizing position IDs, attention masks, and KV-cache state. We present the first authentic batch speculative decoding framework. We (1) formalize the synchronization invariants that valid batch speculative decoding must satisfy, (2) present EQSPEC, the first algorithm that guarantees output equivalence, and analyze its cost structure to show that alignment overhead grows superlinearly and consumes up to 40\% of computation, and (3) introduce EXSPEC, which reduces this overhead through cross-batch scheduling that dynamically groups same-length sequences. On SpecBench across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B pairs, our methods achieve up to 3x throughput improvement at batch size 8 while maintaining algorithmic correctness. Our methods achieve 95\% decoding-equivalence, with residual divergence attributable to floating-point non-determinism in GPU inference, not the synchronization failures that cause near-zero equivalence of prior methods. Our code is available at https://github.com/eBay/spec_dec.

Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Output EquivalenceVicuna
Exact Match97.3
13
Output EquivalenceQwen3
Exact Match0.95
13
Output EquivalenceGLM4
Exact Match96.7
11
Showing 3 of 3 rows

Other info

Follow for update