M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

About

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.

Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Ser-Nam Lim, Yu-Gang Jiang• 2021

Related benchmarks

Task	Dataset	Result
Face Forgery Detection	GenFace (EFS)	Accuracy70.98	52
Face Forgery Detection	GenFace DDPM (test)	Accuracy99.84	51
Deepfake Detection	FF++ (test)	AUC99.5	44
Deepfake Detection	CelebDF (test)	AUC0.657	30
Deepfake Detection	FaceSwapper (test)	Accuracy51.49	30
Facial Forgery Detection	AM IAFaces (test)	Accuracy (ACC)50	30
Face Forgery Detection	GenFace AM	Accuracy55.12	26
Deepfake Detection	FaceForensics++ LQ	AUC0.9531	17
Facial Forgery Detection	AM Diffae (test)	Accuracy50.42	15
Deepfake Detection	DiffFace (test)	Accuracy50.42	15

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord