Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

About

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu• 2026

Related benchmarks

Task	Dataset	Result
Medical Report Grounding	MRG-MS-CXR	AP1091.02	10
Medical Report Grounding	MRG-CHESTX-RAY8	AP@1086.87	10
Medical Visual Grounding	MRG-MIMIC-VQA	AP@1090.51	5
Medical Visual Grounding	MRG-MIMIC-CLASS	AP@1091.77	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord