Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

About

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng• 2024

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	in the wild	EER7.79	65
Audio Deepfake Detection	CodecFake	EER13.7	50
Audio Deepfake Detection	ASVspoof DF 2021	EER2.06	47
Audio Deepfake Detection	ASVspoof LA 2021	EER1.03	41
Audio Deepfake Detection	ASVspoof LA 2019	EER19	38
Audio Deepfake Detection	ASVspoof 2019	EER0.18	37
Spoof Speech Detection	ASVspoof LA 2021 (eval)	min-tDCF0.213	36
Audio Deepfake Detection	FoR	EER4.02	28
Synthetic Speech Detection	ASVspoof DF 2021 (eval)	EER (%)2.06	25
Audio Deepfake Detection	ADD Track 1 2022	EER23.63	19

Showing 10 of 37 rows

Other info

Code

Follow for update

@wizwand_team Discord