Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports
About
Pre-training lays the foundation for recent successes in radiograph analysis supported by deep learning. It learns transferable image representations by conducting large-scale fully-supervised or self-supervised learning on a source domain. However, supervised pre-training requires a complex and labor intensive two-stage human-assisted annotation process while self-supervised learning cannot compete with the supervised paradigm. To tackle these issues, we propose a cross-supervised methodology named REviewing FreE-text Reports for Supervision (REFERS), which acquires free supervision signals from original radiology reports accompanying the radiographs. The proposed approach employs a vision transformer and is designed to learn joint representations from multiple views within every patient study. REFERS outperforms its transfer learning and self-supervised learning counterparts on 4 well-known X-ray datasets under extremely limited supervision. Moreover, REFERS even surpasses methods based on a source domain of radiographs with human-assisted structured labels. Thus REFERS has the potential to replace canonical pre-training methodologies.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-omic Classification | TCGA Lung Cancer | Accuracy93.68 | 14 | |
| Text-Image Retrieval | MIMIC-CXR 5x200 | mAP@160.6 | 9 | |
| Image Classification | MIMIC 5x200 (test) | Accuracy49.5 | 9 | |
| Image-Text Retrieval | MIMIC-CXR 5x200 | mAP@152.4 | 9 | |
| Image Classification | CheXpert 5x200 (test) | Accuracy41.8 | 9 |