ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

About

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	gsm	Accuracy53.29	70
Code Generation	MBPP	Average Score35	30
Legal Reasoning	Law	LLM-as-judge Score34.4	13
Machine Translation	Translation	LLM-as-Judge Score39.23	13
Text Summarization	Summary	LLM-as-judge Score44.4	13
Multi-task Language Understanding	MMLU	Overall Accuracy61.05	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord