Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

About

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity18.6
1875
Multi-task Language UnderstandingMMLU
Accuracy26.8
842
Commonsense ReasoningWinoGrande
Accuracy65.5
776
Question AnsweringARC Challenge
Accuracy40.9
749
Commonsense ReasoningPIQA
Accuracy77.5
647
Code GenerationHumanEval (test)--
444
Question AnsweringARC Easy
Normalized Acc66.8
385
Language ModelingWikiText2 v1 (test)--
341
Physical Interaction Question AnsweringPIQA
Accuracy77.9
323
Question AnsweringSciQ
Accuracy89.9
226
Showing 10 of 33 rows

Other info

Follow for update