Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Chain-of-Thought Reasoning in Language Models

About

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola• 2023

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA
Accuracy73.8
502
Visual Question AnsweringScienceQA
Accuracy74.5
370
Mathematical ReasoningMathVista
Accuracy56.4
257
Science Question AnsweringScienceQA (test)
Average Accuracy91.68
245
Visual Question AnsweringA-OKVQA
Acc50.57
202
Mathematical ReasoningMathVision
Accuracy21.8
144
Multimodal ReasoningMMMU
Accuracy28.7
130
Visual Question AnsweringScienceQA (test)
Accuracy84.91
113
Multimodal Model EvaluationMME
Score1.87e+3
98
Visual PerceptionMMVP
Accuracy68.1
82
Showing 10 of 31 rows

Other info

Code

Follow for update