Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

About

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	--	1442
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy34.07	711
Multitask Language Understanding	MMLU	Accuracy25.93	520
Science Question Answering	ARC Challenge	Accuracy25.51	354
Logical reasoning	BBH	Accuracy17.98	249
Graduate-level Question Answering	GPQA	Accuracy25	215
Science Question Answering	ARC Easy	Accuracy51.3	162
Image Classification	ImageNet-1K	Accuracy73.1	133
Image Classification	VTAB	--	103
General Evaluation	AGIEval	Accuracy25.86	29

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord