ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
About
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Identification | AfroScope High resource | Macro-F198.37 | 16 | |
| Language Identification | AfroScope-Data Mid resource | Macro F191.42 | 8 | |
| Language Identification | UDHR 360 languages (out-of-domain) | F1 Score90.6 | 7 | |
| Language Identification | FLORES-200 199 languages (test) | F197.16 | 7 | |
| Language Identification | GlotLID-C 2099 languages (test) | F1 Score98.68 | 5 | |
| Language Identification | Afroscope | Macro-F187.17 | 5 | |
| Language Identification | BLOOM | Macro F187.95 | 5 | |
| Language Identification | FineWeb2 | Macro F189.03 | 5 | |
| Language Identification | Mafand | Macro F185.43 | 5 | |
| Language Identification | MCS-350 | Macro F1 Score63.54 | 5 |
Showing 10 of 15 rows