Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

About

The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.

Alexandre Muzio, Alex Sun, Churan He• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy74.8
1398
Language UnderstandingMMLU
Accuracy49.64
844
Code GenerationLiveCodeBench
Accuracy58.82
84
Creative WritingWildBench
WildBench Score38.1
49
Multiple-Choice QAMultiple-Choice Suite
MC Avg0.658
49
Multiple-choice Question AnsweringMC (test)
MC Avg73.4
46
Math ReasoningMATH 500
Accuracy96
38
Code GenerationCoding Eval+ LiveCode (test)
Eval+ Score0.3
32
Math ReasoningAIME 2025
Accuracy69.17
27
Math ReasoningCNMO 2024
Accuracy73.61
27
Showing 10 of 16 rows

Other info

Follow for update