EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
About
Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains constrained by memory footprint and throughput because the full expert pool must still be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the layer-wise allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model without costly autoregressive decoding. ESAP is bounded and stable, enabling cheap comparison of many candidates. Building on ESAP, we propose EvoESAP, an evolutionary search framework that finds an improved non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method for criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity. Code is available at https://github.com/ZongfangLiu/EvoESAP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multiple-Choice QA | Multiple-Choice Suite | MC Avg67.5 | 49 | |
| Multiple-choice Question Answering | MC (test) | MC Avg70.6 | 46 | |
| Mathematical Reasoning | GSM8K MATH-500 (test) | Accuracy (GSM8K)91 | 43 | |
| Code Generation | Coding Eval+ LiveCode (test) | Eval+ Score87.2 | 32 | |
| Coding | Coding Suite EvalPlus & LiveCodeBench | Eval+ Score83.8 | 26 | |
| Open-ended generation | WildBench | WildBench0.377 | 26 | |
| Open-ended generation | WildBench (test) | WildBench Score56.5 | 17 |