Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

About

The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results.However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, Tao Lin• 2024

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy81.41	916
Language Modeling	WikiText-103 (test)	Perplexity34.29	773
Object Hallucination Evaluation	POPE	Accuracy87.84	259
Question Answering	BoolQ	Accuracy60.89	233
Commonsense Reasoning	CSQA	CSQA Accuracy61.51	195
Vision-Language Understanding	MMBench	Accuracy51.72	88
Language Understanding	CEval	Accuracy37.6	67
Language Understanding	CMMLU	Accuracy39.35	65
Object Detection	DAIR-V2X	AP@0.568.4	63
Domain Generalization	DomainBed (out-of-domain)	VLCS Accuracy79.4	55

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord