Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

About

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang• 2026

Related benchmarks

TaskDatasetResultRank
Text EmbeddingMTEB (test)
Average Score64.69
14
Text EmbeddingMTEB
Average (Multi-Language/Domain) Performance66.79
12
Showing 2 of 2 rows

Other info

Follow for update