Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

About

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.

Yikun Ji, Yan Hong, Bowen Deng, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Image Forgery DetectionTrace (test)
Accuracy97.2
18
Visual ReasoningTrace (test)
BLEU-10.346
17
Generative image detectionFakeClue (test)
Overall Accuracy90.3
11
Forgery GroundingTrace (test)
IoU35.9
10
Image Forgery Detection and ReasoningMMFR (test)
Accuracy89.3
4
Image Forgery Detection and ReasoningSynthScars (test)
Accuracy85.2
4
Showing 6 of 6 rows

Other info

Follow for update