UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
About
Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy71.48 | 337 | |
| Image-to-Text Retrieval | COCO-CN | -- | 48 | |
| Multimodal Retrieval | Multi30K (test) | Recall (EN)88.2 | 35 | |
| Image-Text Retrieval | MSCOCO (test) | EN Retrieval Score88.1 | 28 | |
| Image-Text Retrieval | Flickr30k (test) | -- | 21 | |
| Cross-modal retrieval | MSCOCO 1K | Mean Recall (ja)87.5 | 16 | |
| Cross-lingual Vision-Language Understanding and Retrieval | IGLUE 1.0 (test) | XVNLI Accuracy63.68 | 16 | |
| Text-Image Retrieval | Flickr&CO (test) | Retrieval Score (DE)28.6 | 14 | |
| Image Retrieval | xFlickr&CO (test) | Recall@120.31 | 7 | |
| Visual Entailment | XVNLI (test) | Accuracy62.05 | 7 |