Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

$\phi$-Balancing for Mixture-of-Experts Training

About

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, Qiang Liu• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningBBH
Accuracy48.82
726
Mathematical ReasoningGSM8K
Accuracy (Acc)69.05
337
Code GenerationHumanEval
Accuracy33.55
217
Math Word Problem SolvingGSM8K
Accuracy92.92
158
Graduate-level Science ReasoningGPQA
Accuracy82.34
121
Mathematical ReasoningMATH500
Accuracy66.2
76
Graduate-level Science Question AnsweringGPQA
Accuracy (GPQA)27.88
72
General EvaluationLiveBench
Accuracy46.83
15
Multi-domain reasoningBBH
Accuracy85.74
9
Multi-task EvaluationMixed Benchmark Average
Average Accuracy40.86
6
Showing 10 of 10 rows

Other info

Follow for update