Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

About

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy80.3	1896
Commonsense Reasoning	WinoGrande	Accuracy57.9	1442
Code Generation	HumanEval	--	1043
Question Answering	ARC Challenge	Accuracy66.8	906
Question Answering	ARC Easy	Accuracy79.9	597
Knowledge	MMLU	Accuracy50.4	161
Question Answering	TriviaQA	Accuracy51.2	117
Question Answering	Natural Questions (NQ)	Accuracy21.4	48
General Language Understanding	NLP Evaluation Suite (SciQ, PIQA, WG, ARC, HellaSwag, LogiQA, BoolQ, LAMBADA)	SciQ Accuracy58.3	14

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord