Vision-Language Models Can Self-Improve Reasoning via Reflection

About

Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs). However, due to the complexity of multimodal scenarios and the difficulty in collecting high-quality CoT data, CoT reasoning in multimodal LLMs has been largely overlooked. To this end, we propose a simple yet effective self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales. Our framework consists of two interleaved parts: (1) iteratively bootstrapping positive and negative solutions for reasoning datasets, and (2) reflection on rationale for learning from mistakes. Specifically, we introduce the self-refine and self-select losses, enabling the model to refine flawed rationale and derive the correct answer by comparing rationale candidates. Experiments on a wide range of vision-language tasks show that R3V consistently improves multimodal LLM reasoning, achieving a relative improvement of 23 to 60 percent over GPT-distilled baselines. Additionally, our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.

Kanzhi Cheng, Yantao Li, Fangzhi Xu, Jianbing Zhang, Hao Zhou, Yang Liu• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVerse	Accuracy12.34	266
Multimodal Reasoning	MMMU	Accuracy14.78	220
Chart Understanding and Reasoning	ChartQA	Accuracy73.1	143
Visual Reasoning	BLINK	Accuracy44.73	116
Multimodal Medical Reasoning	VQA-RAD	Accuracy (%)72.51	48
Multimodal Reasoning	ScienceQA	Average Accuracy86.83	45
Multimodal Reasoning	Medical and Mathematical Multimodal Reasoning SLAKE, VQA-Rad, Geo3K	Overall Performance66.76	36
Multimodal Reasoning	Slake	Accuracy87.04	30
Multimodal Reasoning	Geo3K	Accuracy43.76	21
Visual Question Answering	M3CoT (test)	Language Science Score76.78	15

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord