Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

About

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li• 2023

Related benchmarks

TaskDatasetResultRank
Speaker VerificationVoxCeleb1 (test)
Cosine EER0.984
80
Speaker VerificationVoxCeleb1 hard (test)
EER1.857
25
Speaker VerificationVoxCeleb1 extended (test)
EER1.075
25
Speaker VerificationSITW (eval)
EER1.34
12
Speaker RecognitionVoxCeleb1 extended (vox1-e)
EER (mean)1.075
11
Speaker RecognitionVoxCeleb1 original (vox1-o)
EER (mean)0.984
11
Speaker RecognitionVoxCeleb Hard 1
EER0.0186
6
Showing 7 of 7 rows

Other info

Follow for update