Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

About

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

S. Sutharya, Remya K. Sasi• 2026

Related benchmarks

TaskDatasetResultRank
Audio Deepfake Detectionin the wild
EER45.9
65
Audio Deepfake DetectionFoR
EER10.34
28
Audio Deepfake DetectionMLADDC T2 (test)
Accuracy96.76
6
Temporal boundary localisationMLADDC T3
MAE (s)0.068
3
Audio Deepfake DetectionWaveFake
Accuracy17.3
1
Audio Deepfake DetectionASVspoof 2019
Accuracy84.68
1
Temporal LocalisationMLADDC T2+T3 (test)
Temporal MAE (overall)0.075
1
Three-Class DetectionMLADDC T2+T3 (test)
Overall Accuracy92.71
1
Showing 8 of 8 rows

Other info

Follow for update