SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
About
In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math | GSM8K | Accuracy0.855 | 206 | |
| Code | HumanEval | HumanEval Accuracy88.4 | 79 | |
| Code | HumanEval+ | Accuracy79.9 | 34 | |
| Math | GSM-PLUS | Score64.9 | 22 | |
| Function Calling | BFCL Simple Python | Accuracy0.935 | 20 | |
| Coding | HEval+ | Accuracy75 | 12 | |
| Coding | HEval | Accuracy82.3 | 12 | |
| Tool Calling | BFCL Multiple | Accuracy92.5 | 12 | |
| System Performance Evaluation | Multi-LLM Serving Workload | TTFT (ms)97.1 | 6 |