Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

About

Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy50.5	1820
Visual Question Answering	TextVQA	Accuracy59.8	1453
Visual Question Answering	VQA v2	Accuracy79.1	1429
Visual Question Answering	GQA	Accuracy62.5	1425
Multimodal Understanding	MM-Vet	MM-Vet Score30.4	631
Object Hallucination	POPE Popular	F1 Score87.36	372
Object Hallucination	POPE Adversarial	Accuracy86.54	353
Object Hallucination	POPE (Random)	F1 Score88.7	324
Hallucination Evaluation	MMHal-Bench	MMHal Score2.13	306
Multimodal Perception and Cognition	MME	Overall Score1.53e+3	270

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord