Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

About

While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.

Jiayi He, Hehai Lin, Qingyun Wang, Yi Fung, Heng Ji• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy81.4	847
Science Question Answering	ScienceQA	Accuracy79.9	791
Multimodal Understanding	SEED-Bench	Accuracy70.6	516
Multimodal Understanding	MMMU	Accuracy58.2	437
Multimodal Understanding	MMStar	Accuracy53	407
Diagram Question Answering	AI2D	AI2D Accuracy81.11	387
Multimodal Understanding	MMT-Bench	Accuracy50.4	25
Visual Question Answering	RealWorldQA 1.0 (test)	Accuracy0.6353	15

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord