On the Representation Collapse of Sparse Mixture of Experts
About
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-std) | Accuracy69.7 | 466 | |
| Natural Language Inference | XNLI (test) | Average Accuracy34.5 | 167 | |
| Visual Question Answering | VQAv2 (test-dev) | Accuracy68.4 | 76 | |
| Image Captioning | COCO Caption | CIDEr122.9 | 55 | |
| Visual Reasoning | NLVR2 (test-p) | Accuracy76.1 | 21 | |
| Visual Reasoning | NLVR2 (dev) | Accuracy75.5 | 16 | |
| Language Understanding | LLM Evaluation Harness | ARC-Challenge Acc19.4 | 7 | |
| English-focused language modeling | English-focused language modeling (val) | Perplexity11.96 | 6 | |
| Masked multi-modal modeling | Masked multi-modal modeling (val) | Perplexity12.68 | 6 | |
| Multi-lingual language modeling | Multi-lingual language modeling (val) | Perplexity6.02 | 6 |