AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

About

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1048
Science Question Answering	ScienceQA	Accuracy84.93	916
Mathematical Reasoning	GSM8K	Accuracy93.9	388
General Knowledge	MMLU	MMLU General Knowledge Accuracy72.51	373
Object Hallucination Evaluation	POPE	Accuracy88.27	259
Reading Comprehension	BoolQ	Accuracy (BoolQ)84.71	258
Mathematical Reasoning	AIME 25	Accuracy42.4	112
Instruction Following	IFEval	Accuracy (IFEval)82.4	101
Vision-Language Understanding	MMBench	Accuracy56.95	88
Code Generation	HumanEval+	Pass Rate82.7	75

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord