Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
About
The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While speech overlaps have been regarded as a major obstacle in accurately transcribing meetings, a traditional beamformer with a single output has been exclusively used because previously proposed speech separation techniques have critical constraints for application to real meetings. This paper proposes a new signal processing module, called an unmixing transducer, and describes its implementation using a windowed BLSTM. The unmixing transducer has a fixed number, say J, of output channels, where J may be different from the number of meeting attendees, and transforms an input multi-channel acoustic signal into J time-synchronous audio streams. Each utterance in the meeting is separated and emitted from one of the output channels. Then, each output signal can be simply fed to a speech recognition back-end for segmentation and transcription. Our meeting transcription system using the unmixing transducer outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Significant improvements are observed in overlapped segments. To the best of our knowledge, this is the first report that applies overlapped speech recognition to unconstrained real meeting audio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous speech separation | LibriCSS 0L | WER (Hybrid)8.4 | 13 | |
| Continuous speech separation | LibriCSS 0S | WER (Hybrid)0.114 | 13 | |
| Continuous speech separation | LibriCSS 10% | WER (Hybrid)13.1 | 13 | |
| Continuous speech separation | LibriCSS 20% | WER (Hybrid)14.9 | 13 | |
| Continuous speech separation | LibriCSS 30% | WER (Hybrid)0.187 | 13 | |
| Continuous speech separation | LibriCSS 40% | WER (Hybrid)20.5 | 13 | |
| Continuous speech separation | Real Conversation dataset | WERR-6.4 | 8 | |
| Speech Separation | LibriCSS Utterance-wise, Seven-channel (test) | Hybrid ASR WER (OS)7 | 6 | |
| Speech Separation | LibriCSS Utterance-wise Single-channel (test) | -- | 6 |