Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

About

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a widely recognized training method by 25\%.

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Speed Up (x)1.98	246
Instruction Following	Alpaca	Speedup (x)1.95	173
Code Generation	HumanEval	Speedup Factor2.79	147
Inference Efficiency	HumanEval	Speedup Factor2.15	90
Multi-turn dialogue	MT-Bench	Speedup2.44	80
Code Generation	MBPP	Speedup2.6	79
Speculative Decoding	LiveCodeBench	Speedup Factor2.21	66
Code Generation	HumanEval	Tokens/s71.49	61
LLM Inference Acceleration	GSM8K	Speedup2.1	61
Speculative Decoding	Spec-Bench	MT Score2.74	57

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord