Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BYOL: Bring Your Own Language Into LLMs

About

Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor, Luana Marotti, Inbal Becker-Reshef, Juan Lavista Ferres• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy51.89
1460
Physical Commonsense ReasoningPIQA
Accuracy64.96
329
Mathematical ReasoningMGSM
Accuracy53.2
114
Science Question AnsweringARC Easy
Accuracy43.73
101
Commonsense ReasoningARC-E
Accuracy51.14
62
Causal ReasoningXCOPA
Accuracy71.2
33
Commonsense ReasoningHellaSwag
HellaSwag Score38.11
27
Commonsense ReasoningXCOPA
Accuracy61.2
24
Reading ComprehensionBelebele
Accuracy61
20
Story completionXStoryCloze
Accuracy67.9
20
Showing 10 of 35 rows

Other info

Follow for update