Multimodal Chain-of-Thought Reasoning in Language Models

About

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola• 2023

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy86.5	791
Visual Question Answering	ScienceQA	Accuracy74.5	446
Mathematical Reasoning	MathVista	Accuracy56.4	382
Science Question Answering	ScienceQA (test)	Average Accuracy91.68	273
Visual Question Answering	A-OKVQA	Acc50.57	228
Multimodal Reasoning	MMMU	Accuracy28.7	208
Mathematical Reasoning	MathVision	Accuracy21.8	168
Visual Perception	MMVP	Accuracy68.1	118
Visual Question Answering	ScienceQA (test)	Accuracy84.91	115
Multimodal Model Evaluation	MME	Score1.87e+3	102

Showing 10 of 57 rows

Other info

Code

Follow for update

@wizwand_team Discord