Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Language Models Can Self-Improve Reasoning via Reflection

About

Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs). However, due to the complexity of multimodal scenarios and the difficulty in collecting high-quality CoT data, CoT reasoning in multimodal LLMs has been largely overlooked. To this end, we propose a simple yet effective self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales. Our framework consists of two interleaved parts: (1) iteratively bootstrapping positive and negative solutions for reasoning datasets, and (2) reflection on rationale for learning from mistakes. Specifically, we introduce the self-refine and self-select losses, enabling the model to refine flawed rationale and derive the correct answer by comparing rationale candidates. Experiments on a wide range of vision-language tasks show that R3V consistently improves multimodal LLM reasoning, achieving a relative improvement of 23 to 60 percent over GPT-distilled baselines. Additionally, our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.

Kanzhi Cheng, Yantao Li, Fangzhi Xu, Jianbing Zhang, Hao Zhou, Yang Liu• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMMMU
Accuracy14.78
208
Mathematical ReasoningMathVerse
Accuracy12.34
183
Visual ReasoningBLINK
Accuracy44.73
107
Chart Understanding and ReasoningChartQA
Accuracy73.1
87
Multimodal ReasoningScienceQA
Average Accuracy86.83
45
Multimodal ReasoningMedical and Mathematical Multimodal Reasoning SLAKE, VQA-Rad, Geo3K
Overall Performance66.76
36
Multimodal Medical ReasoningVQA-RAD
Accuracy (%)72.51
36
Multimodal ReasoningSlake
Accuracy87.04
18
Multimodal ReasoningGeo3K
Accuracy43.76
18
Showing 9 of 9 rows

Other info

Follow for update