ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
About
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Identification | AfroScope High resource | Macro-F198.37 | 16 | |
| Language Identification | AfroScope-Data Mid resource | Macro F191.42 | 8 | |
| Language Identification | Afroscope | Macro-F187.17 | 5 | |
| Language Identification | BLOOM | Macro F187.95 | 5 | |
| Language Identification | FineWeb2 | Macro F189.03 | 5 | |
| Language Identification | Mafand | Macro F185.43 | 5 | |
| Language Identification | MCS-350 | Macro F1 Score63.54 | 5 | |
| Language Identification | Smol | Macro F181.55 | 5 | |
| Language Identification | UDHR | Macro F182.12 | 5 | |
| Language Identification | AfroScope-Data Low resource | Macro-F1100 | 4 |
Showing 10 of 10 rows