Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

About

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, Yang You• 2023

Related benchmarks

Task	Dataset	Result
Output Length Prediction	ForeLen LongSeq	MAE145.6	48
Output Length Prediction	ForeLen RL	MAE197.9	32
Output Length Prediction	ForeLen Reasoning	MAE254.6	32
Length Prediction	ForeLen RL 1.0 (test)	MAE133.8	16
Output Length Prediction	LMSYS	MAE91.32	16
Length Prediction	ForeLen Reasoning 1.0 (test)	MAE296	16
Length Prediction	ForeLen Avg. 1.0 (test)	MAE305.7	16
Output Sequence Length Prediction	WritingPrompts super-long sequences (> 17k tokens) OOD	MAE214.5	8
System Performance Evaluation	Long Sequence	Throughput119.2	8
System Performance Evaluation	Reasoning	Throughput141.8	8

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord