BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

About

We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.

Yassine El Kheir, Tim Polzehl, Sebastian M\"oller• 2025

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	ASVspoof DF 2021	EER2.35	87
Audio Deepfake Detection	in the wild	EER7.94	76
Audio Deepfake Detection	ASVspoof LA 2021	EER3.39	53
Audio Deepfake Detection	CodecFake	EER37.7	50
Audio Deepfake Detection	ASVspoof LA 2019	EER71	38
Audio Deepfake Detection	FoR	EER6.85	28
Audio Deepfake Detection	ADD Track 1 2022	EER30.44	19
Audio Deepfake Detection	SONAR	EER27.36	19
Audio Deepfake Detection	ADD Track 3 2022	EER18.69	19
Audio Deepfake Detection	ADD 2023 R1	EER29.44	19

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord