Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

About

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz• 2026

Related benchmarks

TaskDatasetResultRank
Chart Question AnsweringChartQA
Accuracy86.8
356
Multimodal UnderstandingMMStar
Accuracy68.3
324
Hallucination EvaluationMMHal-Bench
MMHal Score4.7
216
Hallucination EvaluationPOPE
Accuracy90.2
153
Visual SearchV*Bench
Accuracy72.8
23
Hallucination EvaluationHaloQuest
Score S80.8
19
Visual Pattern RecognitionMMVP
Accuracy78.7
19
Hallucination EvaluationFINER-CompreCap 1.0 (whole)
Multi-obj Acc (Paired)80
16
Hallucination EvaluationFINER-DOCCI 3K MCQs per setting 1.0
Multi-Object Paired Accuracy65.9
16
Compositional ReasoningNaturalBench
Accuracy35.5
10
Showing 10 of 15 rows

Other info

GitHub

Follow for update