SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

About

The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.

Alexandre Muzio, Alex Sun, Churan He• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy74.8	1398
Language Understanding	MMLU	Accuracy49.64	844
Code Generation	LiveCodeBench	Accuracy58.82	84
Creative Writing	WildBench	WildBench Score38.1	49
Multiple-Choice QA	Multiple-Choice Suite	MC Avg0.658	49
Multiple-choice Question Answering	MC (test)	MC Avg73.4	46
Math Reasoning	MATH 500	Accuracy96	38
Code Generation	Coding Eval+ LiveCode (test)	Eval+ Score0.3	32
Math Reasoning	AIME 2025	Accuracy69.17	27
Math Reasoning	CNMO 2024	Accuracy73.61	27

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord