Rethinking the Use of Vision Transformers for AI-Generated Image Detection

About

Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim• 2025

Related benchmarks

Task	Dataset	Result
Generated Image Detection	GenImage (test)	Average Accuracy88.2	135
Synthetic Image Detection	ForenSynths (test)	Mean Accuracy91.4	60
AI-generated image detection	GenImage 1.0 (test)	Midjourney Detection Rate76.1	24
AI-generated image detection	HIFI-Gen	SDv2.1 ACC82.8	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord