Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

About

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu• 2021

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy71.48
337
Image-to-Text RetrievalCOCO-CN--
48
Multimodal RetrievalMulti30K (test)
Recall (EN)88.2
35
Image-Text RetrievalMSCOCO (test)
EN Retrieval Score88.1
28
Image-Text RetrievalFlickr30k (test)--
21
Cross-modal retrievalMSCOCO 1K
Mean Recall (ja)87.5
16
Cross-lingual Vision-Language Understanding and RetrievalIGLUE 1.0 (test)
XVNLI Accuracy63.68
16
Text-Image RetrievalFlickr&CO (test)
Retrieval Score (DE)28.6
14
Image RetrievalxFlickr&CO (test)
Recall@120.31
7
Visual EntailmentXVNLI (test)
Accuracy62.05
7
Showing 10 of 15 rows

Other info

Follow for update