Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On the Representation Collapse of Sparse Mixture of Experts

About

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, Furu Wei• 2022

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-std)
Accuracy69.7
466
Natural Language InferenceXNLI (test)
Average Accuracy34.5
167
Visual Question AnsweringVQAv2 (test-dev)
Accuracy68.4
76
Image CaptioningCOCO Caption
CIDEr122.9
55
Visual ReasoningNLVR2 (test-p)
Accuracy76.1
21
Visual ReasoningNLVR2 (dev)
Accuracy75.5
16
Language UnderstandingLLM Evaluation Harness
ARC-Challenge Acc19.4
7
English-focused language modelingEnglish-focused language modeling (val)
Perplexity11.96
6
Masked multi-modal modelingMasked multi-modal modeling (val)
Perplexity12.68
6
Multi-lingual language modelingMulti-lingual language modeling (val)
Perplexity6.02
6
Showing 10 of 11 rows

Other info

Follow for update