Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mixture of A Million Experts

About

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

Xu Owen He• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy56.3
1460
Commonsense ReasoningWinoGrande
Accuracy59.4
776
Language UnderstandingMMLU
Accuracy37.4
756
Commonsense ReasoningPIQA
Accuracy75.9
647
Question AnsweringOBQA
Accuracy39.1
276
Question AnsweringARC
Accuracy57.4
154
Question AnsweringTriviaQA
Accuracy16.9
85
Showing 7 of 7 rows

Other info

Follow for update