ST-MoE: Designing Stable and Transferable Sparse Expert Models

About

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	Oxford-IIIT Pets	Accuracy94.21	378
Summarization	XSum (test)	ROUGE-227.1	276
Summarization	Xsum	ROUGE-227.1	108
Natural Language Understanding	SuperGLUE (dev)	Average Score93.2	91
Natural Language Understanding	SuperGLUE	SGLUE Score91.2	84
Text Summarization	CNN/Daily Mail (test)	ROUGE-220.7	77
Natural Language Understanding	SuperGLUE (test)	BoolQ Accuracy92.4	74
Language Modeling	OpenWebText (test)	Average Perplexity2.981	31
Question Answering	TeleQuAD Non-IID	BERTScore F168.23	25
Question Answering	TeleQuAD (IID)	BERTScore F165.41	25

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord