ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

About

In this paper, we present ECAPA2, a novel hybrid neural network architecture and training strategy to produce robust speaker embeddings. Most speaker verification models are based on either the 1D- or 2D-convolutional operation, often manifested as Time Delay Neural Networks or ResNets, respectively. Hybrid models are relatively unexplored without an intuitive explanation what constitutes best practices in regard to its architectural choices. We motivate the proposed ECAPA2 model in this paper with an analysis of current speaker verification architectures. In addition, we propose a training strategy which makes the speaker embeddings more robust against overlapping speech and short utterance lengths. The presented ECAPA2 architecture and training strategy attains state-of-the-art performance on the VoxCeleb1 test sets with significantly less parameters than current models. Finally, we make a pre-trained model publicly available to promote research on downstream tasks.

Jenthe Thienpondt, Kris Demuynck• 2024

Related benchmarks

Task	Dataset	Result
Speaker Verification	VoxCeleb1 (Vox1-O)	EER0.44	160
Speaker Verification	VoxCeleb1 (Vox1-H)	EER1.15	103
Speaker Verification	VoxCeleb-E	EER0.62	95
Speaker Verification	VoxCeleb1-O Cleaned (Original)	EER (%)0.44	61
Speaker Verification	VoxCeleb1 Cleaned (Extended)	EER (%)0.62	45
Speaker Verification	VoxCeleb1 Hard Cleaned	EER0.0115	45
Speaker Verification	VoxCeleb1 hard (H)	EER1.01	21
Speaker Verification	VoxCeleb1 extended	EER59	21
Speaker Recognition	SITW (Speakers In The Wild) core-core protocol	EER3.64	9
Speaker Verification	CHAINS Norm vs Norm	EER0.21	7

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord