An Empirical Study of Training End-to-End Vision-and-Language Transformers

About

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng• 2021

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy80.33	712
Image Captioning	MS COCO Karpathy (test)	CIDEr128.2	706
Text-to-Image Retrieval	Flickr30k (test)	Recall@182.22	525
Image-to-Text Retrieval	Flickr30K 1K (test)	R@182.2	491
Image Classification	DTD	Accuracy62.2	487
Visual Question Answering	VQA v2 (test-std)	Accuracy80.54	486
Image-to-Text Retrieval	Flickr30k (test)	R@194.3	472
Image Classification	Food101	Accuracy79.2	457
Image Classification	CIFAR100	Accuracy70.3	347
Natural Language Visual Reasoning	NLVR2 (test-p)	Accuracy83.47	346

Showing 10 of 66 rows

Other info

Code

Follow for update

@wizwand_team Discord