Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Two-Pass End-to-End Speech Recognition

About

The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirk\'o Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu• 2019

Related benchmarks

TaskDatasetResultRank
ASR rescoringWSJ (test)
WER8.01
35
ASR rescoringLibriSpeech (test-other)
WER11.97
21
ASR rescoringLibriSpeech clean (test)
WER6.7
21
ASR rescoringMTDialogue (test)
WER0.0927
11
ASR rescoringConvAI (test)
WER5.81
11
ASR rescoringVoxPopuli (test)
WER11.02
11
ASR rescoringSLURP (test)
WER24.91
11
Showing 7 of 7 rows

Other info

Follow for update