TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

About

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Kohei Saijo, Gordon Wichern, Fran\c{c}ois G. Germain, Zexu Pan, Jonathan Le Roux• 2024

Related benchmarks

Task	Dataset	Result
Speech Separation	WSJ0-2Mix (test)	SDRi (dB)25.2	160
Speech Separation	WSJ0-2Mix anechoic clean mixture (test)	SI-SNRi25.1	23
Speech Separation	WHAMR!	SI-SNRi18.5	20
Speech Enhancement	DNS non-blind 2020 (test)	SI-SNR23.3	12
Speech Separation	WHAMR! 1CH	SI-SNRi (dB)18.5	11
Speech Enhancement	DNS dataset	PESQ (Wideband)3.72	9
Sound Event Separation	FSD-Kaggle 2 Sound 2018	SI-SDR14.32	7
Sound Event Separation	FSD-Kaggle 3 Sound 2018	SI-SDR9.27	7
Speech Separation	VCTK 2 Speech	SI-SDR14.52	7
Speech-Sound Event Separation	VCTK + FSD-Kaggle 1 Speech + 1 Sound 2018	SI-SDR18.22	7

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord