Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

About

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.

Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky• 2022

Related benchmarks

TaskDatasetResultRank
Speaker VerificationVoxCeleb1-O Cleaned (Original)
EER (%)0.49
53
Speaker VerificationVoxCeleb1 Cleaned (Extended)
EER (%)0.79
45
Speaker VerificationVoxCeleb1 Hard Cleaned
EER0.017
45
Showing 3 of 3 rows

Other info

Follow for update