UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

About

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu• 2021

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy71.48	337
Image-to-Text Retrieval	COCO-CN	--	48
Image-Text Retrieval	Flickr30k (test)	--	45
Multimodal Retrieval	Multi30K (test)	Recall (EN)88.2	35
Image-Text Retrieval	MSCOCO (test)	EN Retrieval Score88.1	28
Cross-modal retrieval	MSCOCO 1K	Mean Recall (ja)87.5	16
Cross-lingual Vision-Language Understanding and Retrieval	IGLUE 1.0 (test)	XVNLI Accuracy63.68	16
Text-Image Retrieval	Flickr&CO (test)	Retrieval Score (DE)28.6	14
Image Retrieval	xFlickr&CO (test)	Recall@120.31	7
Visual Entailment	XVNLI (test)	Accuracy62.05	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord