Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

About

The surge in multimodal AI's success has sparked concerns over data privacy in vision-and-language tasks. While CLIP has revolutionized multimodal learning through joint training on images and text, its potential to unintentionally disclose sensitive information necessitates the integration of privacy-preserving mechanisms. We introduce a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy. Our proposed method, Dp-CLIP, is rigorously evaluated on benchmark datasets encompassing diverse vision-and-language tasks such as image classification and visual question answering. We demonstrate that our approach retains performance on par with the standard non-private CLIP model. Furthermore, we analyze our proposed algorithm under linear representation settings. We derive the convergence rate of our algorithm and show a trade-off between utility and privacy when gradients are clipped per-batch and the loss function does not satisfy smoothness conditions assumed in the literature for the analysis of DP-SGD.

Alyssa Huang, Peihan Liu, Ryumei Nakada, Linjun Zhang, Wanrong Zhang• 2023

Related benchmarks

TaskDatasetResultRank
ClassificationCIFAR10 (test)
Accuracy35.1
331
Image ClassificationF-MNIST (test)
Accuracy81.1
156
Text-to-Image RetrievalCUHK-PEDES (test)
Recall@113.3
114
ClassificationEuroSAT
Top-1 Accuracy50
26
ClassificationEuroSAT (test)
Top-1 Acc52.3
24
Image-to-Text RetrievalCUHK-PEDES (test)--
24
ClassificationCAMELYON
Accuracy70.6
20
ClassificationCAMELYON (test)
Accuracy73.5
20
Text-to-image person retrievalRSTPReid (test)--
17
Image-to-Text RetrievalRSTPReid (test)
Retrieval Accuracy17.2
10
Showing 10 of 14 rows

Other info

Follow for update