Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

About

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel• 2025

Related benchmarks

Task	Dataset	Result
Text Generation	GSM8K	Accuracy55.02	63
Multiple-Choice Classification	MMLU	Accuracy65.43	47
Open-ended generation	TriviaQA	--	37
Multiple-choice Question Answering	ARC-C	Accuracy53.72	22
Free-form text generation	CoQA	Accuracy64.67	22
Factual Hallucination	FACTOR News	Accuracy67.55	12
Factual Hallucination	TruthfulQA	MC1 Score39.59	12
Factual Hallucination	FACTOR Wiki	Accuracy56.81	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord