Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
About
Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the latest generation methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Deepfake Detection | DFDC | AUC84.3 | 135 | |
| Deepfake Detection | DFDC (test) | AUC84.3 | 87 | |
| Deepfake Detection | DFD | AUC0.965 | 77 | |
| Deepfake Detection | CDFv1, CDFv2, DFD, DFDCP, DFDC (test) | DFD Score96.5 | 42 | |
| Deepfake Detection | DFD | Video AUC0.965 | 23 | |
| Video Deepfake Detection | Celeb-DF (CDF) | Video-level AUC94.7 | 21 | |
| Image Deepfake Detection | DFo | AUC0.991 | 20 | |
| Deepfake Detection | DFDCP | Video-level AUC0.909 | 20 | |
| Deepfake Detection | WildDeepfake (WDF) | Video-level AUC0.848 | 17 | |
| Deepfake Detection | Celeb-DF v2 (test) | Video-level AUC0.947 | 16 |