Attention is All You Need in Speech Separation
About
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Separation | WSJ0-2Mix (test) | SDRi (dB)22.4 | 141 | |
| Speech Separation | WSJ0-2Mix | SI-SNRi (dB)22.3 | 65 | |
| Speech Separation | WHAM! (test) | SI-SNRi (dB)16.4 | 58 | |
| Speech Separation | WHAMR! (test) | ΔSI-SNR14 | 57 | |
| Speech Separation | Libri2Mix (test) | SI-SNRi (dB)19.2 | 45 | |
| Speech Separation | WSJ0-3mix (test) | -- | 29 | |
| Speech Separation | WHAMR! | SI-SNRi14 | 20 | |
| Speech Separation | WHAM! | SI-SNRi (dB)16.4 | 15 | |
| Voice Separation | WSJ0 3mix | SI-SNRi19.5 | 14 | |
| Monaural Speech Separation | WSJ0 3mix | ΔSI-SDR (dB)17.6 | 13 |