ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

About

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut• 2025

Related benchmarks

Task	Dataset	Result
Language Identification	AfroScope High resource	Macro-F198.37	16
Language Identification	AfroScope-Data Mid resource	Macro F191.42	8
Language Identification	UDHR 360 languages (out-of-domain)	F1 Score90.6	7
Language Identification	FLORES-200 199 languages (test)	F197.16	7
Language Identification	GlotLID-C 2099 languages (test)	F1 Score98.68	5
Language Identification	Afroscope	Macro-F187.17	5
Language Identification	BLOOM	Macro F187.95	5
Language Identification	FineWeb2	Macro F189.03	5
Language Identification	Mafand	Macro F185.43	5
Language Identification	MCS-350	Macro F1 Score63.54	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord