Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
About
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | -- | 1085 | |
| Multitask Language Understanding | MMLU | Accuracy25.93 | 413 | |
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy34.07 | 350 | |
| Science Question Answering | ARC Challenge | Accuracy25.51 | 342 | |
| Logical reasoning | BBH | Accuracy17.98 | 201 | |
| Graduate-level Question Answering | GPQA | Accuracy25 | 184 | |
| Science Question Answering | ARC Easy | Accuracy51.3 | 155 | |
| Image Classification | VTAB | -- | 103 | |
| Image Classification | ImageNet-1K | Accuracy73.1 | 92 | |
| General Evaluation | AGIEval | Accuracy25.86 | 29 |