ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization
About
Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Forgery Detection | CocoGlide | -- | 15 | |
| Image-level Forgery Detection | CASIA v1 | F1 Score91.93 | 11 | |
| Image-level Forgery Detection | Coverage | F1 Score79.63 | 11 | |
| Image-level Forgery Detection | NIST16 | F1 Score72.71 | 11 | |
| Image-level Forgery Detection | Weighted Avg | F1 Score82.71 | 11 | |
| Pixel-level Forgery Localization | Coverage | F1 Score67.32 | 11 | |
| Pixel-level Forgery Localization | NIST 16 | F1 Score0.5001 | 11 | |
| Pixel-level Forgery Localization | in the wild | F1 Score69.18 | 11 | |
| Image-level Forgery Detection | DSO | F184.97 | 11 | |
| Pixel-level Forgery Localization | CASIA v1 | F1 Score70.92 | 11 |