Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline

About

Existing facial forgery detection methods typically focus on binary classification or pixel-level localization, providing little semantic insight into the nature of the manipulation. To address this, we introduce Forgery Attribution Report Generation, a new multimodal task that jointly localizes forged regions ("Where") and generates natural language explanations grounded in the editing process ("Why"). This dual-focus approach goes beyond traditional forensics, providing a comprehensive understanding of the manipulation. To enable research in this domain, we present Multi-Modal Tamper Tracing (MMTT), a large-scale dataset of 152,217 samples, each with a process-derived ground-truth mask and a human-authored textual description, ensuring high annotation precision and linguistic richness. We further propose ForgeryTalker, a unified end-to-end framework that integrates vision and language via a shared encoder (image encoder + Q-former) and dual decoders for mask and text generation, enabling coherent cross-modal reasoning. Experiments show that ForgeryTalker achieves competitive performance on both report generation and forgery localization subtasks, i.e., 59.3 CIDEr and 73.67 IoU, respectively, establishing a baseline for explainable multimedia forensics. Dataset and code will be released to foster future research.

Jingchun Lian, Lingyu Liu, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng• 2024

Related benchmarks

Task	Dataset	Result
Report Generation	MMTT	CIDEr59.3	11
Interpretation Generation	MMTT (test)	CIDEr59.3	10
Forgery Localization	MMTT	IoU73.67	6
Report Generation	DQ_F++ zero-shot 2024b	BLEU-148.5	4
Report Generation	SynthScars face-modification	BLEU-110.8	3
Forgery Localization	MMTT (test)	IoU73.67	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord