Mixture-of-Experts with Expert Choice Routing
About
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy29.14 | 1460 | |
| Question Answering | ARC Challenge | Accuracy18.86 | 749 | |
| Commonsense Reasoning | PIQA | Accuracy61.92 | 647 | |
| Question Answering | ARC-E | Accuracy42.97 | 242 | |
| Reading Comprehension | BoolQ | Accuracy60.21 | 219 | |
| Language Modeling | LAMBADA | Accuracy29.26 | 183 | |
| Reading Comprehension | RACE | Accuracy27.37 | 151 | |
| Semantic Textual Similarity | STS (Semantic Textual Similarity) 2012-2016 (test) | STS-12 Score80.71 | 57 | |
| Semantic Textual Similarity | CDSC-R (val) | Spearman Correlation86.29 | 22 | |
| Semantic Textual Similarity | CDSC-R (test) | Spearman's Correlation0.8471 | 22 |