Deep Speech: Scaling up end-to-end speech recognition
About
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER21.74 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER7.89 | 833 | |
| Speech Recognition | WSJ (92-eval) | WER4.94 | 131 | |
| Speech Recognition | Hub5'00 SWB (test) | WER12.6 | 91 | |
| Speech Recognition | Hub5'00 CH (test) | WER19.3 | 28 | |
| Speech Recognition | WSJ 93 (test) | WER6.94 | 13 | |
| Speech Recognition | Hub5'00 (CallHome) | WER19.3 | 11 | |
| Speech Recognition | Hub5'00 Full (test) | WER16 | 6 | |
| Speech Recognition | Original Audio Clean 94 utterances | WER6.56 | 5 | |
| Speech Recognition | Original Audio Noisy | WER19.06 | 5 |