OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

About

Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.2	2019
Visual Question Answering	VizWiz	Accuracy50.76	1820
Visual Question Answering	TextVQA	Accuracy58.2	1453
Multimodal Understanding	MMBench	Accuracy81.87	847
Science Question Answering	ScienceQA	Accuracy90.33	791
Multimodal Evaluation	MME	--	727
Multimodal Understanding	MM-Vet	MM-Vet Score66.9	631
Multimodal Reasoning	MM-Vet	MM-Vet Score31.4	517
Multimodal Understanding	MMStar	Accuracy64.53	407
Hallucination Evaluation	CHAIR	CHAIR_s54.2	393

Showing 10 of 200 rows

...

Other info

Code

Follow for update

@wizwand_team Discord