SD-MoE: Spectral Decomposition for Effective Expert Specialization

About

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy66.13	1896
Question Answering	ARC Challenge	Accuracy40.7	906
Commonsense Reasoning	PIQA	Accuracy76.39	757
Question Answering	ARC Easy	Accuracy71.84	597
Commonsense Reasoning	WinoGrande	Accuracy62.43	453
Commonsense Reasoning	SIQA	Accuracy41.91	168
Reading Comprehension	RACE	Accuracy51.48	151
Word Prediction	Lambada OpenAI	Accuracy47.29	29

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord