Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

About

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training

Tomasz Limisiewicz, Ji\v{r}\'i Balhar, David Mare\v{c}ek• 2023

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionNER Average over all languages (test)
F1 Score70.2
9
Named Entity Recognition20 languages
F1 Score65.4
6
Natural Language Inference20 languages
Accuracy52.3
6
Dependency Labeling6 languages Same script
F1 Score27.8
4
Dependency Labeling6 languages Averaged (test)
F1 Score58.8
4
Masked Language Modeling6 languages Averaged (test)
MRR42.7
4
Part-of-Speech Tagging6 languages (Same script split)
F1 Score41.9
4
Part-of-Speech Tagging6 languages Averaged (test)
F1 Score69.2
4
Sentence Retrieval6 languages Different script split
Accuracy0.23
4
Sentence Retrieval6 languages All transfers
Accuracy27.1
4
Showing 10 of 46 rows

Other info

Code

Follow for update