What Language is This? Ask Your Tokenizer

About

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel• 2026

Related benchmarks

Task	Dataset	Result
Language Identification	UDHR CLD3 label set 80 languages (test)	F1 Score0.992	5
Language Identification	FLORES-200 CLD3 label set 77 languages (test)	F1 Score99.7	5
Language Identification	UDHR Full (366 labels) (test)	F1 Score85.9	4
Language Identification	GlotLID-C CLD3 label set 83 languages (test)	F197.2	4
Language Identification	FLORES-200 Full (190 labels) (test)	F1 Score93.2	4
Language Identification	GlotLID-C Full (1940 labels) (test)	F1 Score92.9	3
Language Identification	Tatoeba 201 langs (out-of-domain)	Macro F141.4	2
Language Identification	UDHR 142 langs (out-of-domain)	Macro F186.8	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord