Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

About

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy72.1	712
Natural Language Visual Reasoning	NLVR2 (test-p)	Accuracy73.4	346
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy72.1	337
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy85.5	216
Visual Entailment	SNLI-VE (test)	Overall Accuracy77.3	199
Referring Expression Comprehension	RefCOCO+ (test-B)	Accuracy73.7	167
Referring Expression Comprehension	RefCOCO+ (dev)	Accuracy80.3	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord