xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

About

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87	2019
Mathematical Reasoning	MathVista	Score39.6	474
Multimodal Capability Evaluation	MM-Vet	Score41	393
Visual Question Answering	TextVQA (val)	VQA Score71	365
Multi-discipline Multimodal Understanding	MMMU	--	363
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy81.5	337
Science Question Answering	ScienceQA (test)	Average Accuracy88.3	273
Multimodal Understanding	SEED	Accuracy72.2	216
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy40.1	216
Multi-discipline Multimodal Understanding	MMMU (val)	Accuracy41.1	212

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord