Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Alethia: A Foundational Encoder for Voice Deepfakes

About

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti• 2026

Related benchmarks

TaskDatasetResultRank
Singing Voice Deepfake DetectionCtrSVDD
EER10.8
16
Partially Fake Speech LocalizationHalf-Truth (HT)
EER9.2
8
Partially Fake Speech LocalizationLlamaPartialSpoof (LPS)
EER19.8
8
Partially Fake Speech LocalizationPartialSpoof (PS)
EER27.1
8
Speech Deepfake DetectionSDD-Eval-50 All
EER5.2
6
Speech Deepfake DetectionSDD-Eval-50 Challenging
EER11.5
6
Audio-Visual Deepfake DetectionPolyGlotFake
EER7.1
4
Audio-Visual Deepfake DetectionPolyGlotFake zero-shot
EER (zero-shot)7.1
4
Source TracingASVspoof5-ST (test)
Silhouette Score0.02
4
Audio-Visual Deepfake DetectionFakeAVCeleb
EER6.3
4
Showing 10 of 11 rows

Other info

Follow for update