Upcycling Large Language Models into Mixture of Experts

About

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models. Code is available.

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	--	1442
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy79.05	711
Multitask Language Understanding	MMLU	Accuracy69.81	520
Science Question Answering	ARC Challenge	Accuracy59.81	354
Logical reasoning	BBH	Accuracy66.67	249
Graduate-level Question Answering	GPQA	Accuracy31.03	215
Science Question Answering	ARC Easy	Accuracy84.97	162
General Evaluation	AGIEval	Accuracy49.36	29
Code Generation	MBPP	MBPP Performance Score53.2	28
Aggregate General Language Modeling	Average 10 Benchmarks	Average Score65.18	21

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord