Word Alignment by Fine-tuning Embeddings on Parallel Corpora

About

Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past work on word alignment has worked by performing unsupervised learning on parallel texts. Recently, however, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing methods to effectively extract alignments from these fine-tuned models. We perform experiments on five language pairs and demonstrate that our model can consistently outperform previous state-of-the-art models of all varieties. In addition, we demonstrate that we are able to train multilingual word aligners that can obtain robust performance on different language pairs. Our aligner, AWESOME (Aligning Word Embedding Spaces of Multilingual Encoders), with pre-trained models is available at https://github.com/neulab/awesome-align

Zi-Yi Dou, Graham Neubig• 2021

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	WikiAnn (test)	Average Accuracy66.4	58
Label Projection	XQuAD	COMET Score84.3	48
Label Projection	MLQA	Average COMET Score84.8	32
Named Entity Recognition	MasakhaNER 2.0	--	11
Named Entity Recognition	NER (Named Entity Recognition) annotation projection dataset (test)	ES Score87.3	7
Opinion Target Extraction	OTE (Opinion Target Extraction) annotation projection (test)	OTE ES Score91.5	7
Annotation Projection	Multi-task sequence labeling dataset (combined)	Average Score78	7
Argument Mining	Argument Mining (AM) annotation projection dataset (test)	ES0.548	7
Named Entity Recognition	WikiANN Turkmen tk (test)	F1 Score60.7	6
Named Entity Recognition	WikiANN Māori mi (test)	F1 Score46.1	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord