Mixture of A Million Experts
About
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy56.3 | 1460 | |
| Commonsense Reasoning | WinoGrande | Accuracy59.4 | 776 | |
| Language Understanding | MMLU | Accuracy37.4 | 756 | |
| Commonsense Reasoning | PIQA | Accuracy75.9 | 647 | |
| Question Answering | OBQA | Accuracy39.1 | 276 | |
| Question Answering | ARC | Accuracy57.4 | 154 | |
| Question Answering | TriviaQA | Accuracy16.9 | 85 |