Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

About

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alum\"ae, Mathew Magimai Doss• 2026

Related benchmarks

TaskDatasetResultRank
Audio Deepfake DetectionASVspoof DF 2021
EER1.83
47
Audio Deepfake DetectionASVspoof LA 2021
EER7.02
41
Audio Deepfake DetectionASVspoof 2019
EER0.49
37
Audio Deepfake DetectionFoR
EER2.92
27
Audio Deepfake DetectionADD Track 1 2022
EER22.06
19
Audio Deepfake DetectionADD Track 3 2022
EER3.56
19
Audio Deepfake DetectionADD 2023 R2
EER16.1
19
Audio Deepfake DetectionCodecFake
EER13.34
19
Audio Deepfake DetectionADD 2023 R1
EER11.47
19
Audio Deepfake DetectionSONAR
EER2.15
19
Showing 10 of 16 rows

Other info

Follow for update