Forensic Similarity for Speech Deepfakes
About
In this paper, we introduce the concept of forensic similarity in the speech deepfake detection domain, which aims to determine whether two audio segments share the same underlying forensic traces. Our approach is inspired by prior work in the image domain. To transfer this idea to the audio domain, we propose a two-stage deep learning framework consisting of a Siamese-based feature extractor and a core decision module, referred to as the similarity network. The system goal to assess whether two speech samples originate from the same source by comparing their forensic characteristics. In practice, the model maps pairs of audio segments to a similarity score indicating whether they contain identical or different forensic traces. We evaluate the proposed method on the emerging task of source verification, demonstrating its ability to determine whether two speech samples were generated by the same model. In addition, we explore its applicability to audio splicing detection as a complementary use case. Experimental results show that the proposed approach generalizes well to previously unseen forensic traces, highlighting its robustness, flexibility, and practical relevance for digital audio forensics.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Source verification | MLAAD open-set in-domain (test) | EER10.5 | 4 | |
| Source verification | TIMIT-TTS out-of-domain (test) | EER31.1 | 4 | |
| Source verification | ASVspoof out-of-domain 2019 (test) | EER25.6 | 4 | |
| Source verification | Average aggregated (test) | EER22.4 | 4 |