Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

About

As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.

Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, Asma Ben Abacha• 2026

Related benchmarks

TaskDatasetResultRank
Medical Visual Question AnsweringVQA-Med Open
ECE0.006
192
Medical Visual Question AnsweringVQA-Med Closed
ECE0.013
96
Medical Visual Question AnsweringSLAKE Closed
ACE36.7
96
Medical Visual Question AnsweringSLAKE Open
ACE45.2
96
Medical Visual Question AnsweringVQA-RAD Closed
ECE1.3
96
Visual Question AnsweringVQA-Med Closed
AUROC81.8
96
Visual Question AnsweringVQA-Med Open
AUROC0.825
96
Visual Question AnsweringVQA-RAD Closed
AUROC70.2
96
Visual Question AnsweringVQA-RAD Open
AUROC0.819
96
Medical Visual Question AnsweringPooled Medical VQA Datasets Open 5-fold CV
ACE0.075
96
Showing 10 of 21 rows

Other info

Follow for update