Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

About

Contrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. Previous work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. However, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize out the irrelevant details (depending on the downstream task). In this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent. Concretely, we find that comparing means performs well on a speaker verification task. Next, probing experiments show that standardizing the features effectively removes speaker information. Based on this observation, we propose a speaker normalization step to improve acoustic unit discovery using K-means clustering of CPC features. Finally, we show that a language model trained on the resulting units achieves some of the best results in the ZeroSpeech2021~Challenge.

Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper• 2021

Related benchmarks

Task	Dataset	Result
Syntactic knowledge evaluation	sBLIMP ZeroResource Challenge 2021 (dev)	Success Rate54	9
Voice Conversion	LJSpeech target speaker	WER4.13	7
Voice Conversion	Elliot Miller target speaker	WER5.16	7
Zero-shot Speech Evaluation	sWUGGY	sWUGGY In-Vocab Score72.3	7
Lexical knowledge evaluation	sWUGGY ZeroResource Challenge 2021 (dev)	Success Rate (All)64.3	7
Zero-shot Speech Evaluation	sBLIMP	sBLIMP Score54	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord