BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention
About
We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Deepfake Detection | in the wild | EER7.94 | 58 | |
| Audio Deepfake Detection | ASVspoof DF 2021 | EER2.35 | 35 | |
| Audio Deepfake Detection | ASVspoof LA 2021 | EER3.39 | 23 | |
| Audio Deepfake Detection | ASVspoof LA 2021 | EER3.83 | 12 | |
| Audio Deepfake Detection | ASVspoof LA 2019 | EER71 | 11 | |
| Audio Deepfake Detection | ASVspoof 5 | EER13.67 | 9 | |
| Audio Deepfake Detection | ADD Track 1 2022 | F1 Score56.7 | 7 | |
| Audio Deepfake Detection | ADD Track 1 2022 | EER30.44 | 7 | |
| Audio Deepfake Detection | ASVspoof 2024 | F1 Score72 | 7 | |
| Audio Deepfake Detection | LibriVoc | F1 Score92.9 | 7 |