OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

About

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Hugo Lauren\c{c}on, Lucile Saulnier, L\'eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy74.6	2019
Visual Question Answering	VizWiz	Accuracy48.3	1820
Visual Question Answering	TextVQA	Accuracy39.3	1453
Visual Question Answering	VQA v2	Accuracy65.4	1429
Visual Question Answering	GQA	Accuracy45.2	1425
Multimodal Understanding	MMBench	Accuracy54.5	847
Multimodal Evaluation	MME	Score1.35e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy65.9	712
Multimodal Understanding	MM-Vet	MM-Vet Score54.5	631
Visual Question Answering	GQA	Accuracy38.4	524

Showing 10 of 134 rows

...

Other info

Code

Follow for update

@wizwand_team Discord