Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

About

Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.

Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, Song Guo• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy66.82
425
Video UnderstandingVideoMME--
222
Video UnderstandingEgoSchema--
158
Chart UnderstandingChartQA
Accuracy82.79
127
Visual Question AnsweringTextVQA
Accuracy87.43
94
Video UnderstandingLVB
Accuracy57.84
89
Image UnderstandingMMStar
Score61.82
54
Real-world Visual UnderstandingRealworldQA
Accuracy62.56
47
Image UnderstandingTextVQA
Accuracy83.68
40
Multi-modal UnderstandingMMVet
Accuracy68.13
40
Showing 10 of 17 rows

Other info

Follow for update