Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

About

Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim• 2025

Related benchmarks

TaskDatasetResultRank
Generated Image DetectionGenImage (test)
Average Accuracy88.2
103
Synthetic Image DetectionForenSynths (test)
Mean Accuracy91.4
31
AI-generated image detectionGenImage 1.0 (test)
Midjourney Detection Rate76.1
24
AI-generated image detectionHIFI-Gen
SDv2.1 ACC82.8
8
Showing 4 of 4 rows

Other info

Follow for update