Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

About

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU62.8
2731
Object DetectionCOCO 2017 (val)
AP63.7
2454
Image ClassificationImageNet-1k (val)
Top-1 Accuracy89.6
1453
Object DetectionCOCO (test-dev)
mAP63.7
1195
Semantic segmentationADE20K
mIoU62.8
936
Image ClassificationImageNet 1k (test)
Top-1 Accuracy89.6
798
Image CaptioningMS COCO Karpathy (test)
CIDEr147.6
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy84.2
664
Image ClassificationImageNet-1k (val)
Top-1 Accuracy89.6
512
Object DetectionCOCO v2017 (test-dev)
mAP63.7
499
Showing 10 of 57 rows

Other info

Code

Follow for update