A Data-Centric Approach to Generalizable Speech Deepfake Detection
About
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Deepfake Detection | in the wild | EER0.8 | 58 | |
| Audio Deepfake Detection | ITW | ACC98.82 | 15 | |
| Speech Deepfake Detection | FakeOrReal | EER13 | 9 | |
| Speech Deepfake Detection | EF | EER10 | 7 | |
| Speech Deepfake Detection | ADD ASVspoof 2022 | EER0.82 | 7 | |
| Speech Deepfake Detection | ADD ASVspoof 2023 | EER2.25 | 7 | |
| Speech Deepfake Detection | DV | EER0.86 | 7 | |
| Speech Deepfake Detection | FSW | EER (%)6.47 | 7 | |
| Speech Deepfake Detection | ODSS | EER (%)1.23 | 7 | |
| Speech Deepfake Detection | FoR | Accuracy99.78 | 6 |