Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput

About

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S$^{3}$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49$\times$ throughput over those systems that assume the worst case for the output sequence length.

Yunho Jin, Chun-Feng Wu, David Brooks, Gu-Yeon Wei• 2023

Related benchmarks

TaskDatasetResultRank
Output Length PredictionForeLen LongSeq
MAE161.8
48
Output Length PredictionForeLen RL
MAE163.9
32
Output Length PredictionForeLen Reasoning
MAE152.3
32
Output Length PredictionLMSYS
MAE83.51
16
Length PredictionForeLen Reasoning 1.0 (test)
MAE169.7
16
Length PredictionForeLen Avg. 1.0 (test)
MAE183.5
16
Length PredictionForeLen RL 1.0 (test)
MAE168
16
Output Sequence Length PredictionWritingPrompts super-long sequences (> 17k tokens) OOD
MAE211
8
System Performance EvaluationLong Sequence
Throughput115.1
8
System Performance EvaluationReasoning
Throughput139.9
8
Showing 10 of 13 rows

Other info

Follow for update