Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
About
Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images. This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task. The model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinatory). Furthermore, this paper proposes an efficient pipeline for constructing positive~(non-hallucinatory) and negative~(hallucinatory) sample pairs, ensuring a high-quality, style-consistent dataset for robust preference learning. When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities. Notably, the MiniGPT-4 model, when enhanced with HA-DPO, demonstrated a substantial improvement: POPE accuracy rose from 51.13% to 86.13% (an absolute improvement of 35%), and the MME score surged from 932.00 to 1326.46 (a relative improvement of 42.32%). The codes, models, and datasets are made accessible at https://opendatalab.github.io/HA-DPO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy77.6 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy56.7 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy53.9 | 1043 | |
| Multimodal Evaluation | MME | -- | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy58 | 496 | |
| Multimodal Capability Evaluation | MM-Vet | Score30.9 | 282 | |
| Science Question Answering | ScienceQA | Accuracy68.1 | 229 | |
| Hallucination Evaluation | MMHal-Bench | MMHal Score1.98 | 174 | |
| Hallucination Evaluation | CHAIR | CHAIR_s46.5 | 166 | |
| Vision Understanding | MMBench | Accuracy63.9 | 104 |