MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

About

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun• 2022

Related benchmarks

Task	Dataset	Result
Anomaly Localization	MVTec AD	Pixel AUROC79.1	534
Semantic segmentation	SIIM	Dice Coefficient (%)56.19	96
Medical Image Classification	COVID	Accuracy77.7	91
Multi-Label Classification	ChestX-Ray14 (test)	AUROC (%)56.4	88
Image-level Anomaly Detection	MVTec AD	AUROC75.2	82
Image-level Anomaly Detection	VisA	AUC69	80
Image Classification	BreastMNIST	Accuracy78.76	74
Anomaly Segmentation	RESC	AUC91.5	74
Anomaly Classification	LiverCT	AUC60.65	72
Anomaly Classification	RESC	AUC (%)66.58	68

Showing 10 of 153 rows

...

Other info

Follow for update

@wizwand_team Discord