VL-BEiT: Generative Vision-Language Pretraining
About
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy77.5 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy77.8 | 466 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@195.8 | 439 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@183.9 | 375 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy82.7 | 327 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy81.9 | 288 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@161.5 | 286 | |
| Image Retrieval | MS-COCO 5K (test) | R@161.5 | 217 | |
| Text Retrieval | MS-COCO 5K (test) | R@179.5 | 182 | |
| Text Retrieval | Flickr30K 1K (test) | R@195.8 | 82 |