Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

About

Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.

Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj• 2024

Related benchmarks

TaskDatasetResultRank
Audio Deepfake Detectionin the wild
EER12.5
58
Audio Deepfake DetectionASVspoof 2021
EER4.4
27
Audio Deepfake DetectionASVspoof 2019
EER0.2
25
Audio Deepfake DetectionMLAAD-EN
EER10.7
18
Audio Deepfake DetectionASVspoof LA and DF 2021
EER (DF)4.4
17
Deepfake Audio DetectionASVspoof LA 2019
EER (%)20
12
Showing 6 of 6 rows

Other info

Follow for update