Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

About

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut• 2021

Related benchmarks

Task	Dataset	Result
Image Captioning	MS COCO Karpathy (test)	CIDEr110.9	706
Image Captioning	nocaps (val)	CIDEr (Overall)90.2	115
Visual Question Answering	VQA-RAD	Closed Accuracy83.5	64
Image Captioning	NoCaps	CIDEr (in-domain)92.6	36
Visual Question Answering	Slake	Closed Accuracy87.8	27
Image Captioning	nocaps XD (val)	CIDEr90.2	8
Image Captioning	Conceptual Captions Google-CC 3M (dev)	CIDEr105.4	7
Image Captioning	nocaps XD (test)	CIDEr87.3	5

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord