Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
About
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr110.9 | 682 | |
| Image Captioning | nocaps (val) | CIDEr (Overall)90.2 | 93 | |
| Visual Question Answering | VQA-RAD | Closed Accuracy83.5 | 49 | |
| Image Captioning | NoCaps | CIDEr (in-domain)92.6 | 36 | |
| Visual Question Answering | Slake | Closed Accuracy87.8 | 27 | |
| Image Captioning | nocaps XD (val) | CIDEr90.2 | 8 | |
| Image Captioning | Conceptual Captions Google-CC 3M (dev) | CIDEr105.4 | 7 | |
| Image Captioning | nocaps XD (test) | CIDEr87.3 | 5 |