Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

About

Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose \mymethod,\ a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that \mymethod\ effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.

Dokyoon Yoon, Youngsook Song, Woomyong Park• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MM-Vet	MM-Vet Score32.2	517
Hallucination Evaluation	CHAIR	CHAIR_s20.1	393
Hallucination Evaluation	MMHal-Bench	MMHal Score2.72	306
Hallucination Evaluation	POPE	--	217
Vision Understanding	MMBench	--	141
Science Question Answering	SciQA-IMG	SciQA-IMG Accuracy70.3	53
Multimodal Evaluation	LLaVA-Bench	LLaVA-Bench Score71.2	48

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord