Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

About

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Multimodal Understanding	SEED-Bench	--	571
Hallucination Evaluation	CHAIR	CHAIR_s30.85	393
Hallucination Evaluation	MMHal-Bench	MMHal Score2.65	309
Hallucination Evaluation	POPE	Accuracy86.68	281
Science Question Answering	ScienceQA (test)	Average Accuracy71.26	273
Document Visual Question Answering	DocVQA	Accuracy21.45	203
Visual Understanding	MM-Vet	MM-Vet Score39.9	190
Vision Understanding	MMBench	Accuracy66.45	141
Visual Question Answering	MMVP	Accuracy32	82

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord