Xi-Vector Embedding for Speaker Recognition

About

We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. On the technology front, we offer a simple and straightforward extension to the now widely used x-vector. It consists of an auxiliary neural net predicting the frame-wise uncertainty of the input sequence. We show that the proposed extension leads to substantial improvement across all operating points, with a significant reduction in error rates and detection cost. On the theoretical front, our proposal integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks via the pooling layer. In one sense, our proposal integrates the Bayesian formulation of the i-vector to that of the x-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector. Experimental results on the SITW evaluation set show a consistent improvement of over 17.5% in equal-error-rate and 10.9% in minimum detection cost.

Kong Aik Lee, Qiongqiong Wang, Takafumi Koshinaka• 2021

Related benchmarks

Task	Dataset	Result
Speaker Verification	VoxCeleb1 (Vox1-O)	EER36	160
Speaker Verification	VoxCeleb1 (test)	Cosine EER0.936	85
Speaker Verification	VoxCeleb1 hard (test)	EER1.942	25
Speaker Verification	VoxCeleb1 extended (test)	EER1.11	25
Speaker Verification	VoxCeleb1 hard (H)	EER0.98	21
Speaker Verification	VoxCeleb1 extended	EER60	21
Speaker Verification	SITW (eval)	EER1.394	14

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord