Raising the Bar of AI-generated Image Detection with CLIP
About
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| AI-generated image detection | GenImage | -- | 106 | |
| Video Forgery Detection | DVF (test) | AUC (Video Crafter1)63.8 | 19 | |
| AI-generated image detection | DIRE | Accuracy86.8 | 15 | |
| AI-generated image detection | MNW | Accuracy60.7 | 15 | |
| AI-generated image detection | UDF | Accuracy89.5 | 15 | |
| AI-generated image detection | Average (AVG) | Accuracy79.8 | 15 | |
| AI-generated image detection | GANDF | Accuracy79.6 | 15 | |
| AI-generated image detection | CNNDF | Accuracy80.3 | 15 | |
| AI-generated image detection | MGD | Accuracy55.7 | 15 | |
| Binary Video Detection | DVF cross-domain 42 | Accuracy67 | 12 |