SAM Decoding: Speculative Decoding via Suffix Automaton

About

Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, Jing Zhang• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	Alpaca	Speedup (x)1.87	173
Inference Efficiency	HumanEval	Speedup Factor3.35	90
Speculative Decoding	LiveCodeBench	Speedup Factor2.29	66
Code Generation	HumanEval	Tokens/s91.5	61
Speculative Decoding	Spec-Bench	MT Score4.62	57
Inference Acceleration	Spec-Bench	Speedup2.58	53
Speculative Decoding	AIME 2024	Throughput (T)61.11	48
Speculative Decoding	GPQA Diamond	Throughput70.43	48
Summarization	CNN/DM	MAT Score2.71	30
Multi-turn dialogue	MT-Bench	MAT Score2.15	30

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord