Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

What matters when building vision-language models?

About

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Hugo Lauren\c{c}on, L\'eo Tronchon, Matthieu Cord, Victor Sanh• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy81.2
1165
Visual Question AnsweringTextVQA
Accuracy70.4
1117
Object Hallucination EvaluationPOPE
Accuracy86.2
935
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy81.2
664
Text-based Visual Question AnsweringTextVQA
Accuracy73
496
Mathematical ReasoningMathVista
Score52.2
322
Visual Question AnsweringTextVQA (val)
VQA Score73
309
Multimodal ReasoningMM-Vet
MM-Vet Score34
281
Multi-discipline Multimodal UnderstandingMMMU
Accuracy43
266
Visual Question AnsweringOK-VQA
Accuracy53.5
224
Showing 10 of 74 rows
...

Other info

Code

Follow for update