Continuous Speech Separation with Conformer
About
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.
Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Separation | WSJ0-2Mix (test) | -- | 141 | |
| Speech Separation | WHAMR! (test) | ΔSI-SNR6.7 | 57 | |
| Continuous speech separation | LibriCSS 20% | WER (Hybrid)13.5 | 13 | |
| Continuous speech separation | LibriCSS 0S | WER (Hybrid)0.11 | 13 | |
| Continuous speech separation | LibriCSS 0L | WER (Hybrid)8.7 | 13 | |
| Continuous speech separation | LibriCSS 10% | WER (Hybrid)12.6 | 13 | |
| Continuous speech separation | LibriCSS 30% | WER (Hybrid)0.175 | 13 | |
| Continuous speech separation | LibriCSS 40% | WER (Hybrid)19.6 | 13 | |
| Continuous speech separation | Real Conversation dataset | WERR-7.2 | 8 | |
| Speech Separation | LibriCSS Utterance-wise, Seven-channel (test) | Hybrid ASR WER (OS)7.2 | 6 |
Showing 10 of 12 rows