Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention is All You Need in Speech Separation

About

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong• 2020

Related benchmarks

TaskDatasetResultRank
Speech SeparationWSJ0-2Mix (test)
SDRi (dB)22.4
160
Speech SeparationWSJ0-2Mix
SI-SNRi (dB)22.3
65
Speech SeparationLibri2Mix (test)
SI-SNRi (dB)19.2
60
Speech SeparationWHAM! (test)
SI-SNRi (dB)16.4
58
Speech SeparationWHAMR! (test)
ΔSI-SNR14
57
Speech SeparationWSJ0-3mix (test)--
29
Speech SeparationWSJ0-2Mix anechoic clean mixture (test)
SI-SNRi20.4
23
Speech SeparationWHAMR!
SI-SNRi14
20
Speech SeparationWHAM! 2-speaker (test)
SI-SNRi14.7
18
Speech SeparationWSJ0 3mix
SI-SNRi17.6
17
Showing 10 of 24 rows

Other info

Code

Follow for update