EMO: Frustratingly Easy Progressive Training of Extendable MoE

About

Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.

Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu, Eric Xing, Xuezhe Ma• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	ARC-E	Accuracy75.18	544
Question Answering	ARC-C	Accuracy42.49	283
Social Commonsense Reasoning	SIQA	Accuracy42.48	118
Physical Commonsense Reasoning	PIQA	Accuracy (PIQA)74.81	99
Open-domain Question Answering	TriviaQA	EM33.77	88
Open-domain Question Answering	Natural Questions (NQ)	Exact Match (EM)13.35	82
Pronoun Resolution	WinoGrande	Accuracy61.88	64
Causal Reasoning	COPA	Accuracy80	63
Boolean Question Answering	BoolQ	Accuracy70.52	57
OpenBook Question Answering	OBQA	Accuracy29.6	32

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord