Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

About

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Hugo Lauren\c{c}on, Lucile Saulnier, L\'eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy65.4
1165
Visual Question AnsweringTextVQA
Accuracy39.3
1117
Visual Question AnsweringVizWiz
Accuracy48.3
1043
Visual Question AnsweringGQA
Accuracy45.2
963
Object Hallucination EvaluationPOPE
Accuracy74.6
935
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy65.9
664
Multimodal EvaluationMME
Score1.35e+3
557
Multimodal UnderstandingMM-Vet
MM-Vet Score54.5
418
Visual Question AnsweringGQA
Accuracy38.4
374
Multimodal UnderstandingMMBench
Accuracy54.5
367
Showing 10 of 100 rows
...

Other info

Code

Follow for update