Sparse Backpropagation for MoE Training

About

One defining characteristic of Mixture-of-Expert (MoE) models is their capacity for conducting sparse computation via expert routing, leading to remarkable scalability. However, backpropagation, the cornerstone of deep learning, requires dense computation, thereby posting challenges in MoE gradient computations. Here, we introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Unlike typical MoE training which strategically neglects certain gradient terms for the sake of sparse computation and scalability, SparseMixer provides scalable gradient approximations for these terms, enabling reliable gradient estimation in MoE training. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations with negligible computational overhead. Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain, accelerating training convergence up to 2 times.

Liyuan Liu, Jianfeng Gao, Weizhu Chen• 2023

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy30.24	1896
Question Answering	ARC Challenge	Accuracy19.8	906
Commonsense Reasoning	PIQA	Accuracy62.89	757
Question Answering	ARC-E	Accuracy46.72	544
Language Modeling	LAMBADA	Accuracy34.12	412
Reading Comprehension	BoolQ	Accuracy45.96	279
Reading Comprehension	RACE	Accuracy29	151
Mathematical Reasoning	gsm	Accuracy1.3	70
Code Generation	MBPP	Average Score0.00e+0	30
Legal Reasoning	Law	LLM-as-judge Score3.4	13

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord