TSELM: Target Speaker Extraction using Discrete Tokens and Language Models
About
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
Beilong Tang, Bang Zeng, Ming Li• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER6.4 | 1156 | |
| Automatic Speech Recognition | LibriSpeech clean Speech Noise - Additive (test) | WER8.7 | 28 | |
| Automatic Speech Recognition | LibriSpeech other Speech Noise - Additive (test) | WER17.7 | 28 | |
| Automatic Speech Recognition | LibriSpeech other Speech Noise - Reverb (test) | WER49.5 | 28 | |
| Automatic Speech Recognition | LibriSpeech clean Speech Noise - Reverb (test) | WER36.5 | 28 | |
| Automatic Speech Recognition | LibriSpeech Clean other (test) | WER12.5 | 28 | |
| Target Speaker Extraction | Libri2Mix Clean | DNSMOS OVL3.212 | 14 | |
| Target Speaker Extraction | Libri2Mix Clean (test) | DNSMOS SIG3.478 | 9 | |
| Target Speaker Extraction | Libri2Mix Single Speaker (test) | WER8.9 | 5 |
Showing 9 of 9 rows