Speaker anonymization using neural audio codec language models
About
The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech Clean other (test) | WER41 | 34 | |
| Speech Emotion Recognition | IEMOCAP | Weighted Accuracy (WA)65.57 | 6 | |
| Voice Anonymization | LibriSpeech clean (test) | EER41.88 | 4 | |
| Voice Anonymization | Librispeech other (test) | EER37.88 | 4 | |
| Voice Anonymization | LibriTTS clean (test) | EER43.06 | 4 | |
| Voice Anonymization | LibriTTS other (test) | EER43.18 | 4 | |
| Voice Anonymization | IEMOCAP | EER53 | 4 | |
| Voice Anonymization | NVIDIA RTX 3090 GPU | RTF1.62 | 3 |