A Better and Faster End-to-End Model for Streaming ASR
About
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER2.6 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.4 | 833 | |
| Speech Recognition | WSJ (92-eval) | WER1.3 | 131 | |
| Automatic Speech Recognition | SWITCHBOARD swbd | WER4.3 | 39 | |
| Automatic Speech Recognition | TED-LIUM (test) | WER5.2 | 19 | |
| Automatic Speech Recognition | AMI IHM | WER9 | 10 | |
| Speech Recognition | YouTube (test) | WER9.1 | 10 | |
| Automatic Speech Recognition | AMI SDM English (eval) | WER21.2 | 8 | |
| Automatic Speech Recognition | Switchboard Fisher (CH) | WER0.068 | 6 | |
| Automatic Speech Recognition | Common Voice+ (test) | WER (%)8.4 | 6 |