Unified Vision-Language Pre-Training for Image Captioning and VQA

About

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao• 2019

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy24.3	1429
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy70.5	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.293	706
Visual Question Answering	VQA v2 (test-std)	Accuracy70.7	486
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy70.5	337
Image Retrieval	MS-COCO 5K (test)	R@125.3	217
Visual Question Answering	VQAv2	Accuracy0.00e+0	196
Text Retrieval	MS-COCO 5K (test)	R@141.2	182
Visual Question Answering	VQA (test-dev)	--	147
Image Retrieval	MS-COCO 1K (test)	R@147.1	128

Showing 10 of 26 rows

Other info

Code

Follow for update

@wizwand_team Discord