$\phi$-Balancing for Mixture-of-Experts Training
About
Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy48.82 | 726 | |
| Mathematical Reasoning | GSM8K | Accuracy (Acc)69.05 | 337 | |
| Code Generation | HumanEval | Accuracy33.55 | 217 | |
| Math Word Problem Solving | GSM8K | Accuracy92.92 | 158 | |
| Graduate-level Science Reasoning | GPQA | Accuracy82.34 | 121 | |
| Mathematical Reasoning | MATH500 | Accuracy66.2 | 76 | |
| Graduate-level Science Question Answering | GPQA | Accuracy (GPQA)27.88 | 72 | |
| General Evaluation | LiveBench | Accuracy46.83 | 15 | |
| Multi-domain reasoning | BBH | Accuracy85.74 | 9 | |
| Multi-task Evaluation | Mixed Benchmark Average | Average Accuracy40.86 | 6 |