Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

About

Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.

Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang• 2025

Related benchmarks

TaskDatasetResultRank
General KnowledgeMMLU
MMLU General Knowledge Accuracy72.93
307
ReasoningARC Easy
Accuracy95.94
233
Commonsense ReasoningWino
Accuracy68.51
146
ReasoningARC-C
Accuracy90.85
112
Code GenerationHumanE
Accuracy75.61
82
Open-domain Question AnsweringNQ
Accuracy27.87
74
Question AnsweringTriQA
Accuracy61.32
47
Commonsense ReasoningHellaS
Accuracy84.54
44
General Language UnderstandingAverage
Average Accuracy72.93
26
Showing 9 of 9 rows

Other info

Follow for update