Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Path-Constrained Mixture-of-Experts

About

Sparse Mixture-of-Experts (MoE) architectures route each token through a subset of experts at each layer independently. We propose viewing MoE computation through the lens of \emph{expert paths} -- the sequence of expert selections a token makes across all layers. This perspective reveals that, despite $N^L$ possible paths for $N$ experts across $L$ layers, tokens in practice cluster into a small fraction of paths that align with linguistic function, yet the vast majority of paths remain unexplored, representing a statistical inefficiency. This motivates architectures that constrain the effective path space to amplify this natural concentration. As one instantiation, we introduce \pathmoe{}, which shares router parameters across blocks of consecutive layers. Analysis confirms that \pathmoe{} amplifies the emergent path structure: it produces more concentrated path clusters, better cross-layer consistency, and greater robustness to routing perturbations. Experiments on 0.9B and 16B parameter \pathmoe{} models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary losses. These results establish expert paths as a useful design axis for MoE architectures, complementary to existing work on independent routing mechanisms.

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingFineweb 100B
Perplexity (PPL)12.29
9
Zero-shot Commonsense ReasoningReasoning Suite Zero-shot (ARC-E, BoolQ, HSwag, LAMBADA, OBQA, PIQA, SocIQA, WinoGr.)
ARC-E Accuracy45.5
9
Zero-shot Reasoning and KnowledgeDCLM-Pro
ARC-E Accuracy58.84
4
Zero-shot language evaluationDCLM-Pro
WinoGrande57.93
2
Showing 4 of 4 rows

Other info

Follow for update