Speaker anonymization using neural audio codec language models

About

The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.

Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech Clean other (test)	WER41	34
Speech Emotion Recognition	IEMOCAP	Weighted Accuracy (WA)65.57	11
Voice Anonymization	LibriSpeech clean (test)	EER41.88	4
Voice Anonymization	Librispeech other (test)	EER37.88	4
Voice Anonymization	LibriTTS clean (test)	EER43.06	4
Voice Anonymization	LibriTTS other (test)	EER43.18	4
Voice Anonymization	IEMOCAP	EER53	4
Voice Anonymization	NVIDIA RTX 3090 GPU	RTF1.62	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord