Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

About

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr110.9
682
Image Captioningnocaps (val)
CIDEr (Overall)90.2
93
Visual Question AnsweringVQA-RAD
Closed Accuracy83.5
49
Image CaptioningNoCaps
CIDEr (in-domain)92.6
36
Visual Question AnsweringSlake
Closed Accuracy87.8
27
Image Captioningnocaps XD (val)
CIDEr90.2
8
Image CaptioningConceptual Captions Google-CC 3M (dev)
CIDEr105.4
7
Image Captioningnocaps XD (test)
CIDEr87.3
5
Showing 8 of 8 rows

Other info

Code

Follow for update