SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

About

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

Masoud Jalili Sabet, Philipp Dufter, Fran\c{c}ois Yvon, Hinrich Sch\"utze• 2020

Related benchmarks

Task	Dataset	Result
Word Alignment	English-French (test)	AER5	37
Word Alignment	Romanian-English (Ro-En) (test)	AER23	34
Word Alignment	RWTH Gold Alignment de-en (test)	AER0.17	31
Word Alignment	English-Hindi en-hi (test)	AER44	17
Word Alignment	EuroParl en-de, en-fr, en-hi, en-ro WPT2003, WPT2005	AER (en-de)18.93	12
Word Alignment	Arxiv Sure Only (S) (test)	F1 Score91.7	7
Word Alignment	Arxiv Sure and Possible (S + P) (test)	F1 Score91.9	7
Word Alignment	Wiki Sure Only (S) (test)	F1 Score94.8	7
Annotation Projection	Multi-task sequence labeling dataset (combined)	Average Score85.3	7
Word Alignment	en-hi WPT05 (test)	AER30	7

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord