Pruning and Distilling Mixture-of-Experts into Dense Language Models

About

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy32.1	897
Question Answering	ARC Challenge	Accuracy (ARC)28.2	631
Multi-task Language Understanding	MMLU	MMLU Accuracy28.7	456
Commonsense Reasoning	WinoGrande	Accuracy53	453
Question Answering	ARC Easy	Accuracy53.7	246
Science Question Answering	ARC Easy	Accuracy36.7	108
Multitask Knowledge	MMLU	Accuracy23.7	92
Language Understanding	Llama-3.1-70B Evaluation Suite MMLU, WinoGrande, HellaSwag, ARC-Easy, ARC-Challenge	MMLU46.1	13
General Language Modeling Evaluation	Aggregate Wino Hella ARC-E ARC-C MMLU	Average Accuracy33.71	11
General Language Understanding	Winogrande, HellaSwag, ARC, MMLU Consolidated	Average Accuracy42.39	11

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord