SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

About

In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee• 2026

Related benchmarks

Task	Dataset	Result
Math	GSM8K	Accuracy0.855	216
Code	HumanEval	HumanEval Accuracy88.4	118
Code	HumanEval+	Accuracy79.9	43
Math	GSM-PLUS	Score64.9	28
Function Calling	BFCL Simple Python	Accuracy0.935	20
Coding	HEval	Accuracy82.3	20
Coding	HEval+	Accuracy75	12
Tool Calling	BFCL Multiple	Accuracy92.5	12
System Performance Evaluation	Multi-LLM Serving Workload	TTFT (ms)97.1	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord