Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline
About
Existing facial forgery detection methods typically focus on binary classification or pixel-level localization, providing little semantic insight into the nature of the manipulation. To address this, we introduce Forgery Attribution Report Generation, a new multimodal task that jointly localizes forged regions ("Where") and generates natural language explanations grounded in the editing process ("Why"). This dual-focus approach goes beyond traditional forensics, providing a comprehensive understanding of the manipulation. To enable research in this domain, we present Multi-Modal Tamper Tracing (MMTT), a large-scale dataset of 152,217 samples, each with a process-derived ground-truth mask and a human-authored textual description, ensuring high annotation precision and linguistic richness. We further propose ForgeryTalker, a unified end-to-end framework that integrates vision and language via a shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling coherent cross-modal reasoning. Experiments show that ForgeryTalker achieves competitive performance on both report generation and forgery localization subtasks, i.e., 59.3 CIDEr and 73.67 IoU, respectively, establishing a baseline for explainable multimedia forensics. Dataset and code will be released to foster future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Report Generation | MMTT | CIDEr59.3 | 11 | |
| Interpretation Generation | MMTT (test) | CIDEr59.3 | 10 | |
| Forgery Localization | MMTT | IoU73.67 | 6 | |
| Report Generation | DQ_F++ zero-shot 2024b | BLEU-148.5 | 4 | |
| Report Generation | SynthScars face-modification | BLEU-110.8 | 3 | |
| Forgery Localization | MMTT (test) | IoU73.67 | 3 |