Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

About

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng• 2024

Related benchmarks

TaskDatasetResultRank
Audio Deepfake Detectionin the wild
EER7.79
58
Spoof Speech DetectionASVspoof LA 2021 (eval)
min-tDCF0.213
36
Audio Deepfake DetectionASVspoof DF 2021
EER2.06
35
Audio Deepfake DetectionASVspoof LA 2021
EER1.03
23
Synthetic Speech DetectionASVspoof DF 2021 (eval)
EER (%)2.06
19
Speech Spoofing DetectionIn-the-Wild (ITW) (eval)
EER7.79
19
Audio Deepfake DetectionASVspoof LA and DF 2021
EER (DF)2.06
17
Audio Deepfake DetectionASVspoof LA 2021
EER3
12
Audio Deepfake DetectionASVspoof LA 2019
EER19
11
Spoofing Attack DetectionASVspoof LA 2021
EER1.18
9
Showing 10 of 30 rows

Other info

Code

Follow for update