Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

About

By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao• 2024

Related benchmarks

TaskDatasetResultRank
Question AnsweringOpenBookQA
Accuracy35.8
465
Natural Language InferenceRTE
Accuracy92.4
448
Multi-task Language UnderstandingMMLU
Accuracy83.6
321
Question AnsweringBoolQ--
317
Reading ComprehensionBoolQ
Accuracy88
279
Question AnsweringOpenBookQA
Accuracy35.8
119
Recognizing Textual EntailmentRTE
Accuracy71.1
47
General Language EvaluationAggregated MMLU, BoolQ, OpenBookQA, RTE
Average Accuracy67.6
42
Language UnderstandingMMLU
Humanities Avg63.7
33
Multiple-choice Question AnsweringMMLU
STEM Accuracy60.2
33
Showing 10 of 12 rows

Other info

Follow for update