Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
About
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy79.1 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy59.8 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy50.5 | 1043 | |
| Visual Question Answering | GQA | Accuracy62.5 | 963 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score30.4 | 418 | |
| Object Hallucination | POPE (Random) | F1 Score88.7 | 200 | |
| Object Hallucination | POPE Adversarial | Accuracy86.54 | 196 | |
| Object Hallucination | POPE Popular | F1 Score87.36 | 188 | |
| Hallucination Evaluation | MMHal-Bench | MMHal Score2.13 | 174 | |
| Multimodal Perception and Cognition | MME | Overall Score1.53e+3 | 103 |