Fine-tuning wav2vec2 for speaker recognition
About
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant, w2v2-aam, achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at https://github.com/nikvaessen/w2v2-speaker.
Nik Vaessen, David A. van Leeuwen• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker Verification | VoxCeleb1 hard (test) | -- | 25 | |
| Speaker Recognition | VoxCeleb1 original (vox1-o) | EER (mean)1.91 | 11 | |
| Speaker Recognition | VoxCeleb1 extended (vox1-e) | EER (mean)2.22 | 11 |
Showing 3 of 3 rows