Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

About

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Komal Chugh, Parul Gupta, Abhinav Dhall, Ramanathan Subramanian• 2020

Related benchmarks

Task	Dataset	Result
Deepfake Detection	DFDC (test)	AUC73.8	130
Deepfake Detection	FakeAVCeleb (test)	Accuracy82.8	66
Audio-visual video forgery detection	FakeAVCeleb	Accuracy69.29	41
Deepfake Detection	DeepfakeTIMIT LQ	AUC97.92	19
Deepfake Detection	DeepfakeTIMIT HQ	AUC0.9687	19
Audio-Visual Deepfake Detection	FakeAVCeleb	Accuracy93.71	17
Audio-Visual Deepfake Detection	FakeAVCeleb (test)	ACC (Audio-Visual)82.8	16
Audio-Visual Deepfake Detection	DeepFake Detection Challenge (DFDC)	Accuracy89.8	11
Deepfake Detection	AV-Deepfake1M official (test)	AUC0.5657	11
Temporal Deepfake Localization	LAV-DF	AP@0.512.78	10

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord