Efficient LLM Scheduling by Learning to Rank
About
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Output Length Prediction | ForeLen LongSeq | MAE124.2 | 48 | |
| Output Length Prediction | ForeLen Reasoning | MAE134.6 | 32 | |
| Output Length Prediction | ForeLen RL | MAE131.8 | 32 | |
| Length Prediction | ForeLen Reasoning 1.0 (test) | MAE139 | 16 | |
| Output Length Prediction | LMSYS | MAE77.03 | 16 | |
| Length Prediction | ForeLen Avg. 1.0 (test) | MAE157.5 | 16 | |
| Length Prediction | ForeLen RL 1.0 (test) | MAE170.9 | 16 | |
| System Performance Evaluation | Reasoning | Throughput150.6 | 8 | |
| System Performance Evaluation | Long Sequence | Throughput127.1 | 8 | |
| Output Sequence Length Prediction | WritingPrompts super-long sequences (> 17k tokens) OOD | MAE255.1 | 8 |