Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks

About

The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While speech overlaps have been regarded as a major obstacle in accurately transcribing meetings, a traditional beamformer with a single output has been exclusively used because previously proposed speech separation techniques have critical constraints for application to real meetings. This paper proposes a new signal processing module, called an unmixing transducer, and describes its implementation using a windowed BLSTM. The unmixing transducer has a fixed number, say J, of output channels, where J may be different from the number of meeting attendees, and transforms an input multi-channel acoustic signal into J time-synchronous audio streams. Each utterance in the meeting is separated and emitted from one of the output channels. Then, each output signal can be simply fed to a speech recognition back-end for segmentation and transcription. Our meeting transcription system using the unmixing transducer outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Significant improvements are observed in overlapped segments. To the best of our knowledge, this is the first report that applies overlapped speech recognition to unconstrained real meeting audio.

Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, Fil Alleva• 2018

Related benchmarks

TaskDatasetResultRank
Continuous speech separationLibriCSS 0L
WER (Hybrid)8.4
13
Continuous speech separationLibriCSS 0S
WER (Hybrid)0.114
13
Continuous speech separationLibriCSS 10%
WER (Hybrid)13.1
13
Continuous speech separationLibriCSS 20%
WER (Hybrid)14.9
13
Continuous speech separationLibriCSS 30%
WER (Hybrid)0.187
13
Continuous speech separationLibriCSS 40%
WER (Hybrid)20.5
13
Continuous speech separationReal Conversation dataset
WERR-6.4
8
Speech SeparationLibriCSS Utterance-wise, Seven-channel (test)
Hybrid ASR WER (OS)7
6
Speech SeparationLibriCSS Utterance-wise Single-channel (test)--
6
Showing 9 of 9 rows

Other info

Follow for update