Many-Speakers Single Channel Speech Separation with Optimal Permutation Training
About
Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified architecture that can handle the increased number of speakers. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.
Shaked Dovrat, Eliya Nachmani, Lior Wolf• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Separation | Libri-5Mix | SI-SDRi (dB)12.72 | 9 | |
| Speech Separation | Libri-10Mix | SI-SDRi (dB)7.78 | 9 | |
| Audio Separation | Libri5Mix (test) | SI-SDRi (dB)13.5 | 6 | |
| Speech Separation | WSJ 5mix | SI-SDRi (dB)13.22 | 5 | |
| Speech Separation | Libri-15Mix | SI-SDRi (dB)5.66 | 1 | |
| Speech Separation | LibriMix 20Mix | SI-SDRi (dB)4.26 | 1 |
Showing 6 of 6 rows