Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

About

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.

Shwai He, Weilin Cai, Jiayi Huang, Ang Li• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy64.7	954
Video Understanding	MVBench	Accuracy68.74	635
Natural Language Inference	RTE	Accuracy73.3	590
Multimodal Understanding	SEED-Bench	--	571
Code Generation	HumanEval	pass@150.8	329
Long Video Understanding	LongVideoBench (val)	Accuracy61.22	282
Video Understanding	VideoMME	--	222
Video Understanding	EgoSchema	--	185
Chart Understanding	ChartQA	Accuracy83.91	159
Real-world Visual Understanding	RealworldQA	Accuracy69.46	110

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord