Cross-lingual Visual Pre-training for Multimodal Machine Translation
About
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation | Multi30k En→Fr v1 2017 (test) | -- | 30 | |
| Machine Translation | Multi30K En → De (test) | METEOR47.7 | 26 | |
| Machine Translation (En-Fr) | Multi30K 2016 (test) | METEOR75.9 | 18 | |
| Machine Translation | CoMMuTE En-Fr | Accuracy50.1 | 8 | |
| Machine Translation | En-Fr 2016 (test) | BLEU61.4 | 8 | |
| Machine Translation | En-Fr 2017 (test) | BLEU53.6 | 8 | |
| Machine Translation | En-De 2016 (test) | BLEU39.4 | 8 | |
| Machine Translation | MSCOCO En-De | BLEU28.2 | 8 | |
| Machine Translation | Multi30k En-De 2016 (test) | METEOR55.4 | 8 | |
| Machine Translation | MSCOCO En-Fr | BLEU43.4 | 8 |