Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs
About
Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-ended Medical Visual Question Answering | MIMIC | F1 Score58.1 | 13 | |
| Open-ended Medical Visual Question Answering | VinDr | F1 Score54.2 | 13 | |
| Yes/No Medical Visual Question Answering | MIMIC | F1 Score75.2 | 13 | |
| Yes/No Medical Visual Question Answering | VinDr | F1 Score72.8 | 13 | |
| Medical Visual Question Answering | MIMIC Textual Perturbation (test) | Anatomy Accuracy79.2 | 9 | |
| Medical Visual Question Answering | VinDr Textual Perturbation (test) | Anatomy Accuracy77 | 9 | |
| Medical Visual Question Answering | MIMIC Visual Perturbation (test) | VPT Score73.4 | 9 | |
| Medical Visual Question Answering | VinDr Visual Perturbation (test) | VPT Score70.1 | 9 | |
| Binary Question Answering (Yes/No) | VinDr (subset) | F1 Score72.8 | 5 | |
| Open-ended Question Answering | VinDr (subset) | F1 Score54.2 | 5 |