MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

About

The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

Yaning Zhang, Tianyi Wang, Zitong Yu, Zan Gao, Linlin Shen, Shengyong Chen• 2024

Related benchmarks

Task	Dataset	Result
Deepfake Attribution	DF40 and FFHQ unseen generators	SimSwap Accuracy71.52	54
Face Forgery Detection	GenFace (EFS)	Accuracy76.88	52
Face Forgery Detection	GenFace DDPM (test)	Accuracy100	51
Attribution	WildDeepfake	Accuracy73.44	34
Deepfake Detection	FaceSwapper (test)	Accuracy76.52	30
Facial Forgery Detection	AM IAFaces (test)	Accuracy (ACC)55.26	30
Face Forgery Detection	GenFace AM	Accuracy62.58	26
Deepfake Attribution	LivePortrait unseen	Accuracy74.88	20
Deepfake Attribution	LIA unseen	Accuracy81.52	20
Deepfake Attribution	FSRT unseen	Accuracy (%)81.92	20

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord