ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
About
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.1 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.9 | 833 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER3.9 | 411 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)1.9 | 319 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-other) | WER4.1 | 81 | |
| Speech Recognition | LibriSpeech clean (dev) | WER0.02 | 59 | |
| Speech Recognition | LibriSpeech (test) | -- | 59 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-clean) | WER0.019 | 53 | |
| Speech Recognition | YouTube (test) | WER8.2 | 10 |