Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

About

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU62.8	3069
Object Detection	COCO 2017 (val)	AP63.7	2843
Image Classification	ImageNet-1k (val)	Top-1 Accuracy89.6	1498
Object Detection	COCO (test-dev)	mAP63.7	1239
Semantic segmentation	ADE20K	mIoU62.8	1028
Image Classification	ImageNet 1k (test)	Top-1 Accuracy89.6	880
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy84.2	712
Image Classification	ImageNet-1k (val)	Top-1 Accuracy89.6	708
Image Captioning	MS COCO Karpathy (test)	CIDEr147.6	706
Text-to-Image Retrieval	Flickr30K	R@181.5	559

Showing 10 of 62 rows

Other info

Code

Follow for update

@wizwand_team Discord