Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
About
The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Alzheimer stage classification | ADNI | AUC89.7 | 116 | |
| Segmentation | LiTS | Dice Score55.1 | 45 | |
| Segmentation | ACDC | DSC74.9 | 41 | |
| Classification | CT-RATE | AUC0.798 | 29 | |
| Radiology Report Generation | Radiology Reports | BLEU26.9 | 28 | |
| AD diagnosis | ADNI (test) | -- | 28 | |
| Classification | Kidney Trauma 27 (test) | AUC65.4 | 27 | |
| Classification | Liver Trauma 27 (test) | AUC66.4 | 27 | |
| Classification | Spleen Trauma 27 (test) | AUC70.3 | 27 | |
| Classification | RSNA ICH 19 (test) | AUC65.5 | 27 |