VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

About

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Furu Wei• 2021

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU53.4	3069
Image Classification	ImageNet-1k (val)	Top-1 Accuracy85.5	920
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy82.8	712
Image-to-Text Retrieval	Flickr30K 1K (test)	R@195.3	491
Visual Question Answering	VQA v2 (test-std)	Accuracy80	486
Text-to-Image Retrieval	Flickr30K 1K (test)	R@184.5	432
Natural Language Visual Reasoning	NLVR2 (test-p)	Accuracy89.54	346
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy82.88	337
Image-to-Text Retrieval	MS-COCO 5K (test)	R@178.2	320
Text-to-Image Retrieval	MSCOCO 5K (test)	R@160.6	312

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord