Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mixture-of-Experts with Expert Choice Routing

About

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon• 2022

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy29.14
1460
Question AnsweringARC Challenge
Accuracy18.86
749
Commonsense ReasoningPIQA
Accuracy61.92
647
Question AnsweringARC-E
Accuracy42.97
242
Reading ComprehensionBoolQ
Accuracy60.21
219
Language ModelingLAMBADA
Accuracy29.26
183
Reading ComprehensionRACE
Accuracy27.37
151
Semantic Textual SimilaritySTS (Semantic Textual Similarity) 2012-2016 (test)
STS-12 Score80.71
57
Semantic Textual SimilarityCDSC-R (val)
Spearman Correlation86.29
22
Semantic Textual SimilarityCDSC-R (test)
Spearman's Correlation0.8471
22
Showing 10 of 17 rows

Other info

Follow for update