$\phi$-Balancing for Mixture-of-Experts Training

About

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, Qiang Liu• 2026

Related benchmarks

Task	Dataset	Result
Reasoning	BBH	Accuracy48.82	770
Mathematical Reasoning	GSM8K	Accuracy (Acc)69.05	352
Code Generation	HumanEval	Accuracy33.55	224
Math Word Problem Solving	GSM8K	Accuracy92.92	158
Graduate-level Science Reasoning	GPQA	Accuracy82.34	138
Mathematical Reasoning	MATH500	Accuracy66.2	124
Graduate-level Science Question Answering	GPQA	Accuracy (GPQA)27.88	72
Multi-domain reasoning	BBH	Accuracy85.74	39
General Evaluation	LiveBench	Accuracy46.83	15
Multi-task Evaluation	Mixed Benchmark Average	Average Accuracy40.86	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord