CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training
About
A large-scale image-text pair dataset has greatly contributed to the development of vision-language pre-training (VLP) models, which enable zero-shot or few-shot classification without costly annotation. However, in the medical domain, the scarcity of data remains a significant challenge for developing a powerful VLP model. In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt and utilizing multiple images and multiple sections in a radiologic report. We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports, respectively. Our model outperforms the state-of-the-art models trained under the same conditions. Also, enlarged dataset improve the discriminative power of our pre-trained model for classification, while sacrificing marginal retrieval performance. Code is available at https://github.com/kakaobrain/cxr-clip.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | SIIM | AUC94 | 54 | |
| Thoracic Disease Classification | MIMIC-CXR (test) | Atelectasis AUC50 | 28 | |
| Classification | VinDR-CXR | AUC0.89 | 24 | |
| Classification | RSNA | AUC89.8 | 24 | |
| Image-to-Text Retrieval | Open-i | -- | 17 | |
| Image-to-Text Retrieval | CheXpert 5X200 | R@19.4 | 13 | |
| Image-to-Text Retrieval | MIMIC-CXR | R@121.6 | 13 | |
| Image Classification | MIMIC 5x200 (test) | Accuracy49.7 | 9 | |
| Text-Image Retrieval | MIMIC-CXR 5x200 | mAP@160.2 | 9 | |
| Image-Text Retrieval | MIMIC-CXR 5x200 | mAP@151.8 | 9 |