M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

About

We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.

Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Jianfeng Gao, Dongdong Zhang, Nan Duan• 2020

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	COCO-CN	--	49
Image-to-Text Retrieval	COCO-CN	--	48
Image-Text Retrieval	Flickr30k (test)	--	45
Multimodal Retrieval	Multi30K (test)	Recall (EN)87.7	35
Image-Text Retrieval	MSCOCO (test)	EN Retrieval Score88.7	28
Cross-modal retrieval	MSCOCO 1K	Mean Recall (ja)87.9	16
Cross-lingual Vision-Language Understanding and Retrieval	IGLUE 1.0 (test)	XVNLI Accuracy59.36	16
Text-Image Retrieval	Flickr&CO (test)	Retrieval Score (DE)13.35	14
Visual Reasoning	MaRVL (test)	Accuracy56	7
Visual Reasoning	MaRVL	ID56.47	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord