MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

About

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.6	2056
Text-based Visual Question Answering	TextVQA	Accuracy72.5	984
Science Question Answering	ScienceQA	Accuracy82.1	916
Multimodal Evaluation	MME	--	902
Visual Question Answering	ChartQA	Accuracy67.2	620
Multimodal Understanding	SEED-Bench	Accuracy70.2	571
Diagram Question Answering	AI2D	AI2D Accuracy59.3	509
Multimodal Capability Evaluation	MM-Vet	Score41	429
Visual Question Answering	TextVQA (val)	VQA Score79.2	371
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.925	351

Showing 10 of 84 rows

...

Other info

Code

Follow for update

@wizwand_team Discord