Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unified Vision-Language Pre-Training for Image Captioning and VQA

About

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao• 2019

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy24.3
1165
Image CaptioningMS COCO Karpathy (test)
CIDEr1.293
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy70.5
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy70.7
466
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy70.5
337
Image RetrievalMS-COCO 5K (test)
R@125.3
217
Text RetrievalMS-COCO 5K (test)
R@141.2
182
Visual Question AnsweringVQAv2
Accuracy0.00e+0
177
Visual Question AnsweringVQA (test-dev)--
147
Image RetrievalMS-COCO 1K (test)
R@147.1
128
Showing 10 of 26 rows

Other info

Code

Follow for update