Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Building and better understanding vision-language models: insights and future directions

About

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

Hugo Lauren\c{c}on, Andr\'es Marafioti, Victor Sanh, L\'eo Tronchon• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Multimodal EvaluationMME--
658
Science Question AnsweringScienceQA--
502
OCR EvaluationOCRBench
Score55
329
Multi-discipline Multimodal UnderstandingMMMU--
317
Visual Mathematical ReasoningMathVista
Accuracy58.4
278
Diagram UnderstandingAI2D
Accuracy76.5
247
Multi-discipline Multimodal UnderstandingMMMU (val)--
204
Visual UnderstandingMM-Vet
MM-Vet Score41.7
142
Hallucination EvaluationHallusionBench
Average Score43.7
108
Showing 10 of 36 rows

Other info

Follow for update