MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

About

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu H\`e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.8	2019
Visual Question Answering	VizWiz	Accuracy57.9	1820
Visual Question Answering	TextVQA	Accuracy72.8	1453
Visual Question Answering	VQA v2	Accuracy82.8	1429
Text-based Visual Question Answering	TextVQA	Accuracy72.8	962
Multimodal Understanding	MMBench	Accuracy72.7	847
Science Question Answering	ScienceQA	--	791
Multimodal Evaluation	MME	Score1.99e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy83.7	712
Multimodal Understanding	MM-Vet	MM-Vet Score72.1	631

Showing 10 of 101 rows

...

Other info

Code

Follow for update

@wizwand_team Discord