What matters when building vision-language models?

About

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Hugo Lauren\c{c}on, L\'eo Tronchon, Matthieu Cord, Victor Sanh• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.2	2019
Visual Question Answering	TextVQA	Accuracy70.4	1453
Visual Question Answering	VQA v2	Accuracy81.2	1429
Text-based Visual Question Answering	TextVQA	Accuracy73	962
Multimodal Understanding	MMBench	--	847
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy81.2	712
Human-Object Interaction Detection	HICO-DET (test)	mAP (full)6.59	544
Multimodal Reasoning	MM-Vet	MM-Vet Score34	517
Mathematical Reasoning	MathVista	Score52.2	474
Hallucination Evaluation	CHAIR	CHAIR_s7.6	393

Showing 10 of 118 rows

...

Other info

Code

Follow for update

@wizwand_team Discord