Wavesplit: End-to-End Speech Separation by Speaker Clustering
About
We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset. Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates from a single abdominal electrocardiogram.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Separation | WSJ0-2Mix (test) | SDRi (dB)22.3 | 141 | |
| Speech Separation | WSJ0-2Mix | SI-SNRi (dB)22.2 | 65 | |
| Speech Separation | WHAM! (test) | SI-SNRi (dB)16 | 58 | |
| Speech Separation | WHAMR! (test) | ΔSI-SNR13.2 | 57 | |
| Speech Separation | Libri2Mix (test) | SI-SNRi (dB)19.5 | 45 | |
| Speech Separation | WSJ0-3mix (test) | SI-SNRi17.8 | 29 | |
| Speech Separation | WHAMR! | SI-SNRi13.2 | 20 | |
| Source Separation | WSJ0-2Mix (test) | SI-SNRi22.2 | 17 | |
| Speech Separation | WHAM! | SI-SNRi (dB)16 | 15 | |
| Speaker Separation | WSJ0-2mix 8kHz (test) | ΔSDR22.3 | 14 |