Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

About

Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the Yuan3.0-1T Base model from scratch original with 1515B parameters, LAEP achieves a 48.3% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.

YuanLab.ai: Shawn Wu, Jiangang Luo, Tong Yu, Darcy Chen, Sean Wang, Xudong Zhao, Louie Li, Claire Wang, Hunter He, Carol Wang, Allen Wang• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
Accuracy78
756
MathGSM8K
Accuracy0.861
87
CodeHumanEval
HumanEval Accuracy70.7
50
MathematicsMATH
MATH Accuracy66.1
32
CodingMBPP
Accuracy75.9
31
Natural Language UnderstandingARC Challenge
Accuracy94.3
14
Training EfficiencyYuan3.0-1T Pre-training Base (train)
TFLOPS92.6
6
LanguagePile (test)
Accuracy59.4
3
LanguageNaturalQuestions
Accuracy0.433
3
Showing 9 of 9 rows

Other info

Follow for update