Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

About

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity12.78
1565
Mathematical ReasoningGSM8K
Accuracy89.6
1398
Commonsense ReasoningHellaSwag
HellaSwag Accuracy31.9
711
Question AnsweringARC Challenge
Accuracy (ARC)27.5
598
Commonsense ReasoningWinoGrande
Accuracy53.1
453
Multi-task Language UnderstandingMMLU
MMLU Accuracy28.1
442
Multiple-choice Question AnsweringARC Easy
Accuracy79.5
257
MathGSM8K
Accuracy0.873
216
Question AnsweringARC Easy
Accuracy52.9
210
CodingHumanEval+--
164
Showing 10 of 41 rows

Other info

Follow for update