Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
About
By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | OpenBookQA | Accuracy35.8 | 465 | |
| Natural Language Inference | RTE | Accuracy71.1 | 367 | |
| Question Answering | BoolQ | -- | 240 | |
| Reading Comprehension | BoolQ | Accuracy88 | 219 | |
| Language Understanding | MMLU | Humanities Avg63.7 | 33 | |
| General Language Evaluation | Aggregated MMLU, BoolQ, OpenBookQA, RTE | Average Accuracy67.6 | 22 |