Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

About

The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.

Ted Zadouri, Ahmet \"Ust\"un, Arash Ahmadian, Beyza Ermi\c{s}, Acyr Locatelli, Sara Hooker• 2023

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@124.83	1043
Multi-task Language Understanding	MMLU	Accuracy73.2	881
Reasoning	BBH	Accuracy35.4	726
Code Generation	HumanEval (test)	Pass@143.78	612
Image Classification	EuroSAT	Accuracy98.63	569
Classification	Cars	Accuracy50.83	492
Image Classification	RESISC45	Accuracy92.58	472
Image Classification	SUN397	Accuracy52.55	450
Image Classification	iNaturalist 2018	Top-1 Accuracy78	291
Commonsense Reasoning	Commonsense Reasoning (BoolQ, PIQA, SIQA, HellaS., WinoG., ARC-e, ARC-c, OBQA) (test)	BoolQ Accuracy73.15	238

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord