Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

About

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

S. Sutharya, Remya K. Sasi• 2026

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	in the wild	EER45.9	76
Audio Deepfake Detection	FoR	EER10.34	28
Audio Deepfake Detection	WaveFake	Accuracy17.3	15
Audio Deepfake Detection	ASVspoof 2019	Accuracy84.68	12
Audio Deepfake Detection	MLADDC T2 (test)	Accuracy96.76	6
Temporal boundary localisation	MLADDC T3	MAE (s)0.068	3
Temporal Localisation	MLADDC T2+T3 (test)	Temporal MAE (overall)0.075	1
Three-Class Detection	MLADDC T2+T3 (test)	Overall Accuracy92.71	1

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord