Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model
About
Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Deepfake Detection | CDFv1, CDFv2, DFD, DFDCP, DFDC (test) | Overall Average Score88.4 | 74 | |
| Deepfake Detection | CelebDF v2 | AUC0.95 | 57 | |
| Deepfake Detection | DFDCP (test) | -- | 55 | |
| Face Forgery Detection | DFDC | AUC81.81 | 52 | |
| Video-level Deepfake Detection | DFDC | AUC0.818 | 34 | |
| Deepfake Detection | KoDF (test) | AUC97.4 | 31 | |
| Video Deepfake Detection | DF-TIMIT (test) | AUC99.02 | 27 | |
| Deepfake Detection | WildDeepfake (WDF) | Video-level AUC0.872 | 26 | |
| Deepfake Detection | Protocol 2 Hybrid, FR, FS, EFS v1 (test) | Hybrid AUC99.4 | 24 | |
| Deepfake Detection | Protocol 1 (FF++, DFDCP, DFD, CDF2) v1 (test) | Accuracy on FF++99.7 | 24 |