Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

About

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang, Weihua Luo• 2026

Related benchmarks

TaskDatasetResultRank
Multilingual Mathematical ReasoningMGSM--
52
Multitask Language UnderstandingGlobalMMLU
Accuracy73.3
18
Showing 2 of 2 rows

Other info

Follow for update