Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
About
Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Deepfake Detection | DFBench Overall (Full partition) | Accuracy54.44 | 9 | |
| Image Deepfake Detection | LIVE | Accuracy91.74 | 9 | |
| Image Deepfake Detection | CSIQ | Accuracy90.03 | 9 | |
| Image Deepfake Detection | TID 2013 | Accuracy88.32 | 9 | |
| Image Deepfake Detection | KADID | Accuracy89.92 | 9 | |
| Image Deepfake Detection | DFBench AI-Edited (test) | Object Enhance Accuracy57.32 | 9 | |
| Image Deepfake Detection | KonIQ-10k | Accuracy85.32 | 9 | |
| Image Deepfake Detection | DFBench Playground AI-generated 1.0 | Accuracy43.33 | 8 | |
| Image Deepfake Detection | DFBench SD3.5 Large (AI-generated) | Accuracy36.41 | 8 |