Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

About

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Hugo Lauren\c{c}on, Lucile Saulnier, L\'eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy48.3
1525
Object Hallucination EvaluationPOPE
Accuracy74.6
1455
Visual Question AnsweringVQA v2
Accuracy65.4
1362
Visual Question AnsweringTextVQA
Accuracy39.3
1285
Visual Question AnsweringGQA
Accuracy45.2
1249
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy65.9
706
Multimodal EvaluationMME
Score1.35e+3
658
Multimodal UnderstandingMMBench
Accuracy54.5
637
Multimodal UnderstandingMM-Vet
MM-Vet Score54.5
531
Visual Question AnsweringGQA
Accuracy38.4
505
Showing 10 of 129 rows
...

Other info

Code

Follow for update