M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
About
Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | RSNA | mAP (%)25.4 | 106 | |
| Semantic segmentation | SIIM | Dice Coefficient (%)64.8 | 96 | |
| Semantic segmentation | RSNA | Dice Score70.5 | 90 | |
| Object Detection | Object-CXR | mAP19.5 | 58 | |
| Chest X-ray classification | NIH (test) | AUROC84 | 47 | |
| Classification | RSNA (test) | F1 Score64.4 | 44 | |
| Linear Classification | RSNA (test) | AUC90.5 | 39 | |
| Linear Classification | COVIDx (test) | Accuracy90.7 | 39 | |
| Linear Classification | CheXpert (test) | AUC0.886 | 39 | |
| Image Classification | SIIM (test) | F1 Score72.1 | 30 |