Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

About

Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.

Maha Elbayad, Anna Sun, Shruti Bhosale• 2022

Related benchmarks

TaskDatasetResultRank
Machine TranslationMMT eng-xx (all)
chrF++55.1
12
Machine TranslationMMT eng-xx high-resource
chrF++64.7
12
Machine TranslationMMT eng-xx very-low-resource
chrF++49.2
12
Multilingual Machine TranslationOPUS-16 XX → En
Score (High Tier)29.84
10
Multilingual Machine TranslationOPUS-16 En → XX
High Score26.52
10
Machine TranslationMMT eng-xx low-resource
chrF++41.8
6
Machine TranslationMMT xx-eng low-resource
chrF++51.5
6
Machine TranslationMMT xx-yy (all)
chrF++42.8
6
Showing 8 of 8 rows

Other info

Follow for update