Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

About

This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.

Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao• 2024

Related benchmarks

Task	Dataset	Result
Summarization	Xsum	--	108
Generative Inference	MT-Bench	Speedup2.26	44
Multi-turn dialogue	Spec-Bench Multi.	CR2.31	21
Summarization	Spec-Bench Sum.	CR Score2.19	21
Translation	Spec-Bench Trans.	CR1.73	21
Mathematical Reasoning	Spec-Bench Math	CR2.15	21
Retrieval-Augmented Generation	Spec-Bench RAG	CR1.88	21
Question Answering	Spec-Bench QA	CR1.72	21
Text Generation	Spec-Bench Overall	SD Score1.46	21
Code Generation	HumanEval-X	--	20

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord