EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models

About

Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.

Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy57.2	1453
Multimodal Capability Evaluation	MM-Vet	Score31.2	393
Hallucination Evaluation	AMBER	CHAIR5.8	222
Hallucination Evaluation	HallusionBench	--	153
Hallucination Evaluation	Object-HalBench	--	78
Object Hallucination Detection	MSCOCO	--	46
Visual Question Answering	VQA v2	Overall Accuracy78.1	45
Text Generation	MSCOCO	BLEU-152.3	26
Science Question Answering	ScienceQA	Image Accuracy66.4	26
Multi-modal Understanding	MM-Vet v1 (full)	Overall Score (MM-Vet v1)31.2	16

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord