FINER: MLLMs Hallucinate under Fine-grained Negative Queries
About
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Chart Question Answering | ChartQA | Accuracy86.8 | 356 | |
| Multimodal Understanding | MMStar | Accuracy68.3 | 324 | |
| Hallucination Evaluation | MMHal-Bench | MMHal Score4.7 | 216 | |
| Hallucination Evaluation | POPE | Accuracy90.2 | 153 | |
| Visual Search | V*Bench | Accuracy72.8 | 23 | |
| Hallucination Evaluation | HaloQuest | Score S80.8 | 19 | |
| Visual Pattern Recognition | MMVP | Accuracy78.7 | 19 | |
| Hallucination Evaluation | FINER-CompreCap 1.0 (whole) | Multi-obj Acc (Paired)80 | 16 | |
| Hallucination Evaluation | FINER-DOCCI 3K MCQs per setting 1.0 | Multi-Object Paired Accuracy65.9 | 16 | |
| Compositional Reasoning | NaturalBench | Accuracy35.5 | 10 |