Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

About

Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We introduce three novel visual metrics within the self-critic process to guide judgment, significantly improving the accuracy of self-critic. Through extensive experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA significantly improves LVLM's performance and outperforms previous approaches, achieving superior modality alignment.

Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, Cao Xiao• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy62.1	1820
Visual Question Answering	TextVQA	--	1453
Text-based Visual Question Answering	TextVQA	Accuracy66.1	962
Multimodal Understanding	MMBench	Accuracy71.04	847
Science Question Answering	ScienceQA	Accuracy72.5	791
Multimodal Evaluation	MME	--	727
Multimodal Understanding	SEED-Bench	Accuracy64.68	516
Multimodal Understanding	MMMU	Accuracy35.14	437
Multimodal Understanding	MMStar	Accuracy32.4	407
Hallucination Evaluation	CHAIR	CHAIR_s50.8	393

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord