Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

An Open Dataset and Model for Language Identification

About

Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield• 2023

Related benchmarks

TaskDatasetResultRank
Language IdentificationFLORES-200 OpenLID 1.0
F1 Score92.3
8
Language IdentificationUDHR OpenLID 1.0
F1 Score88.1
8
Language IdentificationSLIDE
Loose Accuracy94.81
8
Language IdentificationFLORES+ (devtest)
Loose Accuracy99.97
8
Language IdentificationNordic DSL 50k
Loose Accuracy94.17
8
Language IdentificationFLORES-200 ∩ CLD3 95 languages (test)
F1 Score98.9
3
Language IdentificationFLORES-200 ∩ NLLB 193 languages (test)
F1 Score95.9
2
Language IdentificationFLORES-200 201 languages (test)
F192.7
1
Showing 8 of 8 rows

Other info

Code

Follow for update