Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

About

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.

Yikun Ji, Yan Hong, Bowen Deng, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang• 2025

Related benchmarks

Task	Dataset	Result
Image Forgery Detection	Trace (test)	Accuracy97.2	18
Visual Reasoning	Trace (test)	BLEU-10.346	17
Generative image detection	FakeClue (test)	Overall Accuracy90.3	11
Forgery Grounding	Trace (test)	IoU35.9	10
Image Forgery Detection and Reasoning	MMFR (test)	Accuracy89.3	4
Image Forgery Detection and Reasoning	SynthScars (test)	Accuracy85.2	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord