Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training
About
The surge in multimodal AI's success has sparked concerns over data privacy in vision-and-language tasks. While CLIP has revolutionized multimodal learning through joint training on images and text, its potential to unintentionally disclose sensitive information necessitates the integration of privacy-preserving mechanisms. We introduce a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy. Our proposed method, Dp-CLIP, is rigorously evaluated on benchmark datasets encompassing diverse vision-and-language tasks such as image classification and visual question answering. We demonstrate that our approach retains performance on par with the standard non-private CLIP model. Furthermore, we analyze our proposed algorithm under linear representation settings. We derive the convergence rate of our algorithm and show a trade-off between utility and privacy when gradients are clipped per-batch and the loss function does not satisfy smoothness conditions assumed in the literature for the analysis of DP-SGD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | CIFAR10 (test) | Accuracy35.1 | 331 | |
| Image Classification | F-MNIST (test) | Accuracy81.1 | 156 | |
| Text-to-Image Retrieval | CUHK-PEDES (test) | Recall@113.3 | 114 | |
| Classification | EuroSAT | Top-1 Accuracy50 | 26 | |
| Classification | EuroSAT (test) | Top-1 Acc52.3 | 24 | |
| Image-to-Text Retrieval | CUHK-PEDES (test) | -- | 24 | |
| Classification | CAMELYON | Accuracy70.6 | 20 | |
| Classification | CAMELYON (test) | Accuracy73.5 | 20 | |
| Text-to-image person retrieval | RSTPReid (test) | -- | 17 | |
| Image-to-Text Retrieval | RSTPReid (test) | Retrieval Accuracy17.2 | 10 |