OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

About

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: $\textbf{O}$ne $\textbf{F}$or $\textbf{A}$ll ($\textbf{OFA}$), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

Yihong Liu, Peiqin Lin, Mingyang Wang, Hinrich Sch\"utze• 2023

Related benchmarks

Task	Dataset	Result
Question Answering	ARC-E	Accuracy38.17	544
Cross-lingual retrieval	WebFAQ	nDCG@1059.6	32
Text Classification	SIB-200 kmb (test)	Weighted F163.2	10
Text Classification	SIB-200 umb (test)	Weighted F1 (SIB-200 umb test)61.8	10
Text Classification	SIB-200 cjk (test)	Weighted F152.8	10
Text Classification	SIB-200 kon (test)	Weighted F176.9	10
Text Classification	SIB-200 lua (test)	Weighted F168.6	10
Question Answering	Knowledge-based Benchmarks German	ARC Score28.74	8
Question Answering	Knowledge-based Benchmarks Arabic	ARC Score26.52	8
Question Answering	Knowledge-based Benchmarks Vietnamese	ARC Score26.58	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord