DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning
About
Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy66.82 | 247 | |
| Video Understanding | VideoMME | -- | 192 | |
| Chart Understanding | ChartQA | Accuracy82.79 | 83 | |
| Visual Question Answering | TextVQA | Accuracy87.43 | 69 | |
| Video Understanding | EgoSchema | Accuracy58.74 | 49 | |
| Image Understanding | MME | Score2.08e+3 | 39 | |
| Multi-modal Understanding | MMVet | Accuracy68.13 | 35 | |
| Image Understanding | Image Understanding Suite (TextVQA, ChartQA, MMStar, MMBench, MMVet, MME, RealWorldQA, COCO) | TextVQA Score82.04 | 34 | |
| Video Understanding | Video Understanding Suite MVBench, EgoSchema, VMME, LVB, VMMMU | MVBench Score63.15 | 34 | |
| Real-world Visual Understanding | RealworldQA | Accuracy62.56 | 24 |