A Data-Centric Approach to Generalizable Speech Deepfake Detection

About

Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

Wen Huang, Yuchen Mao, Yanmin Qian• 2025

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	in the wild	EER0.8	76
Speech Deepfake Detection	FakeOrReal	EER13	30
Audio Deepfake Detection	ITW	ACC98.82	15
Speech Deepfake Detection	EF	EER10	7
Speech Deepfake Detection	ADD ASVspoof 2022	EER0.82	7
Speech Deepfake Detection	ADD ASVspoof 2023	EER2.25	7
Speech Deepfake Detection	DV	EER0.86	7
Speech Deepfake Detection	FSW	EER (%)6.47	7
Speech Deepfake Detection	ODSS	EER (%)1.23	7
Speech Deepfake Detection	FoR	Accuracy99.78	6

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord