Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multimodal Chain-of-Thought Reasoning in Language Models

About

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola• 2023

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA (test)
Average Accuracy91.68
208
Visual Question AnsweringA-OKVQA
Acc50.57
175
Visual Question AnsweringScienceQA (test)
Accuracy84.91
95
Multi-choice Visual Question AnsweringA-OKVQA
Accuracy50.6
49
Visual Question AnsweringScienceQA Image (test)
Accuracy82.9
45
Multimodal ReasoningMMMU
Accuracy28.7
44
Multimodal ReasoningSEED-Bench Image
Score34.4
32
Multimodal Science Question AnsweringScienceQA v1.0 (test)
Accuracy (Natural Language Component)95.91
31
Multimodal ReasoningM3CoT (test)
Total Acc48.73
31
Visual Question AnsweringMemeVQA 1.0 (test)
Accuracy69
15
Showing 10 of 12 rows

Other info

Code

Follow for update