MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

About

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	--	1453
Mathematical Reasoning	MathVista	Score67.6	474
Optical Character Recognition	OCRBench	--	433
Diagram Question Answering	AI2D	--	387
Document Visual Question Answering	DocVQA	ANLS93.8	301
Video Understanding	VideoMME	--	222
Video Understanding	MVBench (test)	Accuracy59.1	190
Mathematical Reasoning	MathVerse	--	183
Diagram Understanding	AI2D (test)	Accuracy84	154
Chart Understanding	ChartQA (test)	Accuracy86.2	113

Showing 10 of 53 rows

Other info

Code

Follow for update

@wizwand_team Discord