Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

About

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, Fuyu Lv, Ou Dan, Li Yuan• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
1043
General KnowledgeMMLU
MMLU General Knowledge Accuracy80.09
307
Reading ComprehensionBoolQ
Accuracy (BoolQ)88.07
228
Question AnsweringBoolQ
Accuracy88.07
201
Commonsense ReasoningCSQA
CSQA Accuracy86.4
195
Commonsense Question AnsweringCSQA
Accuracy86.4
71
Language UnderstandingCMMLU
Accuracy81.53
62
Commonsense Question AnsweringCSQA
Accuracy70.93
61
General KnowledgeCMMLU
Accuracy81.53
50
Mathematical ReasoningMATH--
46
Showing 10 of 18 rows

Other info

Follow for update