In defence of metric learning for speaker recognition

About

The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.

Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han• 2020

Related benchmarks

Task	Dataset	Result
Speaker Verification	VoxCeleb1 (Vox1-O)	EER2.08	160
Speaker Recognition	VoxCeleb1 (test)	EER2.21	126
Speaker Verification	VoxCeleb1 (test)	Cosine EER2.22	85
Speaker Verification	VoxCeleb1 extended	EER2.18	21
Speaker Verification	VoxCeleb1 hard (H)	EER4.19	21
Speaker Identification	VoxCeleb1 (test)	Top-1 Accuracy77.45	13
Speaker Identification	LibriSpeech (LBS) (test)	Top-1 Accuracy91.56	13

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord