Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
About
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Deepfake Detection | ASVspoof DF 2021 | EER1.83 | 47 | |
| Audio Deepfake Detection | ASVspoof LA 2021 | EER7.02 | 41 | |
| Audio Deepfake Detection | ASVspoof 2019 | EER0.49 | 37 | |
| Audio Deepfake Detection | FoR | EER2.92 | 27 | |
| Audio Deepfake Detection | ADD Track 1 2022 | EER22.06 | 19 | |
| Audio Deepfake Detection | ADD Track 3 2022 | EER3.56 | 19 | |
| Audio Deepfake Detection | ADD 2023 R2 | EER16.1 | 19 | |
| Audio Deepfake Detection | CodecFake | EER13.34 | 19 | |
| Audio Deepfake Detection | ADD 2023 R1 | EER11.47 | 19 | |
| Audio Deepfake Detection | SONAR | EER2.15 | 19 |