An Open Dataset and Model for Language Identification

About

Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield• 2023

Related benchmarks

Task	Dataset	Result
Language Identification	FLORES-200 OpenLID 1.0	F1 Score92.3	8
Language Identification	UDHR OpenLID 1.0	F1 Score88.1	8
Language Identification	SLIDE	Loose Accuracy94.81	8
Language Identification	FLORES+ (devtest)	Loose Accuracy99.97	8
Language Identification	Nordic DSL 50k	Loose Accuracy94.17	8
Language Identification	FLORES-200 ∩ CLD3 95 languages (test)	F1 Score98.9	3
Language Identification	FLORES-200 ∩ NLLB 193 languages (test)	F1 Score95.9	2
Language Identification	FLORES-200 201 languages (test)	F192.7	1

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord