Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Chain-of-Thought Reasoning in Language Models

About

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola• 2023

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA
Accuracy86.5
791
Visual Question AnsweringScienceQA
Accuracy74.5
446
Mathematical ReasoningMathVista
Accuracy56.4
382
Science Question AnsweringScienceQA (test)
Average Accuracy91.68
273
Visual Question AnsweringA-OKVQA
Acc50.57
228
Multimodal ReasoningMMMU
Accuracy28.7
208
Mathematical ReasoningMathVision
Accuracy21.8
168
Visual PerceptionMMVP
Accuracy68.1
118
Visual Question AnsweringScienceQA (test)
Accuracy84.91
115
Multimodal Model EvaluationMME
Score1.87e+3
102
Showing 10 of 57 rows

Other info

Code

Follow for update