Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective

About

Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.

Zihao Yue, Liang Zhang, Qin Jin• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Hallucination Evaluation	MMHal-Bench	MMHal Score2.33	306
Hallucination Evaluation	AMBER	--	222
Generative Hallucination	AMBER Generative	Coverage (%)49.1	81
Generative Hallucination	Object-HalBench	CHAIR_S Score40.3	43
Hallucination assessment	Object-HalBench	Mention Hallucination Rate17.8	39
Object Hallucination Mitigation on Generative Tasks	AMBER	CHAIR5.1	38
Multimodal Capability Evaluation	MM-Star	Average Score32.9	36
Object Hallucination Detection	Object Hall-Bench	Res Score40.3	22
Open-ended generation	LLaVA-Bench	GPT-4 Score60.9	21

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord