M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization

About

Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.

Che Liu, Sibo Cheng, Chen Chen, Mengyun Qiao, Weitong Zhang, Anand Shah, Wenjia Bai, Rossella Arcucci• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	RSNA	mAP (%)25.4	106
Semantic segmentation	SIIM	Dice Coefficient (%)64.8	96
Semantic segmentation	RSNA	Dice Score70.5	90
Object Detection	Object-CXR	mAP19.5	58
Chest X-ray classification	NIH (test)	AUROC84	47
Classification	RSNA (test)	F1 Score64.4	44
Linear Classification	RSNA (test)	AUC90.5	39
Linear Classification	COVIDx (test)	Accuracy90.7	39
Linear Classification	CheXpert (test)	AUC0.886	39
Image Classification	SIIM (test)	F1 Score72.1	30

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord