Enhancing Large Vision Language Models with Self-Training on Image Comprehension

About

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, Wei Wang• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy65.2	1453
Visual Question Answering	ChartQA	Accuracy41.5	519
Multimodal Capability Evaluation	MM-Vet	Score45	393
Mathematical Reasoning	MathVista	Accuracy37	382
Multimodal Model Evaluation	MMBench	Accuracy67.8	204
Scientific Question Answering	ScienceQA	Accuracy75.3	61
Multimodal Evaluation	LLaVA-Bench	LLaVA-Bench Score79.2	48
Object Hallucination Mitigation on Generative Tasks	AMBER	CHAIR7.6	38
Generative Hallucination Mitigation	MMHal-Bench	Overall Score2.07	13

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord