Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models

About

Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model's embedding matrix. In this paper, we propose FOCUS - Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that initializes the embedding matrix effectively for a new tokenizer based on information in the source model's embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work in language modeling and on a range of downstream tasks (NLI, QA, and NER).

Konstantin Dobler, Gerard de Melo• 2023

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC-E
Accuracy48.95
523
Natural Language InferenceXNLI--
131
Natural Language InferenceXNLI 1.0 (test)
Accuracy (en)47
40
Cross-lingual retrievalWebFAQ
nDCG@1059.5
32
Causal ReasoningXCOPA (test)
Accuracy (th)53.8
31
Story completionXStoryCloze 1.0 (test)
XStoryCloze Accuracy (en)66.8
18
Paraphrase IdentificationPAWS-X 1.0 (test)
Accuracy (de)54.4
18
Abstractive SummarizationXL-Sum Ukrainian (test)
BLEU Score5.16
14
General KnowledgeGlobal MMLU Ukrainian (test)
Accuracy (%)60.57
14
Machine TranslationLong FLORES uk to en (test)
BLEU23
14
Showing 10 of 31 rows

Other info

Follow for update