Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

About

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.

Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, Ed H. Chi• 2021

Related benchmarks

TaskDatasetResultRank
Multi-task RegressionMovieLens (test)
Loss3.68e+3
21
Multi-task LearningNYU V2
mIoU53.75
19
Multi-task Learning (Segmentation, Part Segmentation, Disparity)Cityscapes
Semantic Segmentation mIoU69.67
16
Multi-task image classificationMulti-Fashion MNIST (test)
Accuracy 183.78
7
Multi-task image classificationMulti-MNIST (test)
Task 1 Accuracy92.56
7
Engagement Task 1Real-world large-scale content recommender system (out-of-sample)
AUC81.03
2
Engagement Task 2Real-world large-scale content recommender system (out-of-sample)
AUC81.61
2
Engagement Task 3Real-world large-scale content recommender system (out-of-sample)
RMSE0.2874
2
Engagement Task 4Real-world large-scale content recommender system (out-of-sample)
RMSE0.8781
2
Engagement Task 5Real-world large-scale content recommender system (out-of-sample)
AUC75.24
2
Showing 10 of 13 rows

Other info

Code

Follow for update