Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

About

The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.

Alexandre Muzio, Alex Sun, Churan He• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy74.8
1362
Language UnderstandingMMLU
Accuracy49.64
825
Multiple-Choice QAMultiple-Choice Suite
MC Avg0.658
49
Multiple-choice Question AnsweringMC (test)
MC Avg73.4
46
Creative WritingWildBench
WildBench Score38.1
45
Code GenerationCoding Eval+ LiveCode (test)
Eval+ Score0.3
32
Code GenerationLiveCodeBench
Accuracy58.82
30
Math ReasoningAIME 2025
Accuracy69.17
27
Math ReasoningCNMO 2024
Accuracy73.61
27
Science Question AnsweringGPQA
Accuracy62.63
27
Showing 10 of 16 rows

Other info

Follow for update