Alethia: A Foundational Encoder for Voice Deepfakes

About

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti• 2026

Related benchmarks

Task	Dataset	Result
Singing Voice Deepfake Detection	CtrSVDD	EER10.8	16
Partially Fake Speech Localization	Half-Truth (HT)	EER9.2	8
Partially Fake Speech Localization	LlamaPartialSpoof (LPS)	EER19.8	8
Partially Fake Speech Localization	PartialSpoof (PS)	EER27.1	8
Speech Deepfake Detection	SDD-Eval-50 All	EER5.2	6
Speech Deepfake Detection	SDD-Eval-50 Challenging	EER11.5	6
Audio-Visual Deepfake Detection	PolyGlotFake	EER7.1	4
Audio-Visual Deepfake Detection	PolyGlotFake zero-shot	EER (zero-shot)7.1	4
Source Tracing	ASVspoof5-ST (test)	Silhouette Score0.02	4
Audio-Visual Deepfake Detection	FakeAVCeleb	EER6.3	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord