Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays
About
Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Radiology Report Generation | MIMIC-CXR (test) | BLEU-40.175 | 172 | |
| Classification | SIIM | AUC97.4 | 56 | |
| Chest X-ray classification | NIH (test) | AUROC86.1 | 47 | |
| Classification | RSNA (test) | F1 Score84.8 | 44 | |
| Image Classification | SIIM (test) | F1 Score97.4 | 30 | |
| Lesion Segmentation | RSNA 56 | Dice Score80.22 | 12 | |
| Lesion Segmentation | TBX11K 42 | Dice96.56 | 12 | |
| Classification | Shenzhen 21 (test) | F1 Score81.3 | 9 | |
| Classification | RSNA 56 (test) | F1 Score77 | 9 | |
| Structured report generation | SRRG-Findings (test) | BLEU3 | 4 |