Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

About

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee• 2026

Related benchmarks

TaskDatasetResultRank
MathGSM8K
Accuracy0.855
206
CodeHumanEval
HumanEval Accuracy88.4
79
CodeHumanEval+
Accuracy79.9
34
MathGSM-PLUS
Score64.9
22
Function CallingBFCL Simple Python
Accuracy0.935
20
CodingHEval+
Accuracy75
12
CodingHEval
Accuracy82.3
12
Tool CallingBFCL Multiple
Accuracy92.5
12
System Performance EvaluationMulti-LLM Serving Workload
TTFT (ms)97.1
6
Showing 9 of 9 rows

Other info

Follow for update