Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Does a Global Perspective Help Prune Sparse MoEs Elegantly?

About

Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu• 2026

Related benchmarks

TaskDatasetResultRank
Natural Language InferenceRTE
Accuracy92.6
448
Multi-task Language UnderstandingMMLU
Accuracy85.3
321
Question AnsweringOpenBookQA
Accuracy35.2
119
Recognizing Textual EntailmentRTE
Accuracy71.4
47
General Language EvaluationAggregated MMLU, BoolQ, OpenBookQA, RTE
Average Accuracy68.2
42
Multiple-choice Question AnsweringMMLU
STEM Accuracy62.7
33
Boolean Question AnsweringBoolQ
Accuracy89
29
Boolean Question AnsweringBoolQ
Accuracy88
20
Showing 8 of 8 rows

Other info

Follow for update