Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pruning and Distilling Mixture-of-Experts into Dense Language Models

About

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
HellaSwag Accuracy32.1
711
Question AnsweringARC Challenge
Accuracy (ARC)28.2
598
Commonsense ReasoningWinoGrande
Accuracy53
453
Multi-task Language UnderstandingMMLU
MMLU Accuracy28.7
442
Question AnsweringARC Easy
Accuracy53.7
210
Multitask KnowledgeMMLU
Accuracy23.7
92
Science Question AnsweringARC Easy
Accuracy36.7
75
Language UnderstandingLlama-3.1-70B Evaluation Suite MMLU, WinoGrande, HellaSwag, ARC-Easy, ARC-Challenge
MMLU46.1
13
General Language Modeling EvaluationAggregate Wino Hella ARC-E ARC-C MMLU
Average Accuracy33.71
11
General Language UnderstandingWinogrande, HellaSwag, ARC, MMLU Consolidated
Average Accuracy42.39
11
Showing 10 of 11 rows

Other info

Follow for update