Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

About

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alum\"ae, Mathew Magimai Doss• 2026

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	CodecFake	EER13.34	50
Audio Deepfake Detection	ASVspoof DF 2021	EER1.83	47
Audio Deepfake Detection	ASVspoof LA 2021	EER7.02	41
Audio Deepfake Detection	ASVspoof 2019	EER0.49	37
Audio Deepfake Detection	FoR	EER2.92	28
Audio Deepfake Detection	ADD Track 1 2022	EER22.06	19
Audio Deepfake Detection	ADD Track 3 2022	EER3.56	19
Audio Deepfake Detection	ADD 2023 R2	EER16.1	19
Audio Deepfake Detection	ADD 2023 R1	EER11.47	19
Audio Deepfake Detection	SONAR	EER2.15	19

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord