LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

About

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity18.6	3785
Commonsense Reasoning	WinoGrande	Accuracy65.5	1442
Question Answering	ARC Challenge	Accuracy40.9	906
Multi-task Language Understanding	MMLU	Accuracy26.8	881
Commonsense Reasoning	PIQA	Accuracy77.5	757
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy47.04	711
Code Generation	HumanEval (test)	--	612
Multitask Language Understanding	MMLU	Accuracy35.79	520
Physical Interaction Question Answering	PIQA	Accuracy77.9	415
Question Answering	ARC Easy	Normalized Acc66.8	391

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord