Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

About

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.

Myeongkyun Kang, Yanting Yang, Xiaoxiao Li• 2026

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalMIMIC-CXR (test)
R@113.9
20
Text-to-Image RetrievalMIMIC-CXR (test)
R@111.91
12
Phrase groundingPadChest-GR (external val)
Ro/L63.55
6
Phrase groundingPadChest-GR (internal val)
Ro/L70.42
5
Showing 4 of 4 rows

Other info

Follow for update