Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
About
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Summarization | Xsum | -- | 108 | |
| Generative Inference | MT-Bench | Speedup2.26 | 44 | |
| Multi-turn dialogue | Spec-Bench Multi. | CR2.31 | 21 | |
| Summarization | Spec-Bench Sum. | CR Score2.19 | 21 | |
| Translation | Spec-Bench Trans. | CR1.73 | 21 | |
| Mathematical Reasoning | Spec-Bench Math | CR2.15 | 21 | |
| Retrieval-Augmented Generation | Spec-Bench RAG | CR1.88 | 21 | |
| Question Answering | Spec-Bench QA | CR1.72 | 21 | |
| Text Generation | Spec-Bench Overall | SD Score1.46 | 21 | |
| Code Generation | HumanEval-X | -- | 20 |