On the Representation Collapse of Sparse Mixture of Experts

About

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103 (test)	Perplexity35.88	703
Visual Question Answering	VQA v2 (test-std)	Accuracy69.7	486
Intent Classification	Banking77	Accuracy67.4	260
Natural Language Inference	XNLI (test)	Average Accuracy34.5	167
Text Classification	SST-5	Accuracy35.3	119
Text Classification	IMDB	Accuracy83.3	119
Visual Question Answering	VQAv2 (test-dev)	Accuracy68.4	80
Image Captioning	COCO Caption	CIDEr122.9	55
Text Classification	SST-2	Accuracy76.7	54
Visual Reasoning	NLVR2 (dev)	Accuracy75.5	21

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord