Speech Recognition with Deep Recurrent Neural Networks
About
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | IIIT5K | Accuracy64.1 | 149 | |
| Speech Recognition | WSJ (92-eval) | WER22.7 | 131 | |
| Text Recognition | Street View Text (SVT) | Accuracy73.2 | 80 | |
| Scene Text Recognition | IC03 | Accuracy81.8 | 67 | |
| Scene Text Recognition | SVT-Perspective (test) | Accuracy45.7 | 56 | |
| Phoneme Recognition | TIMIT (test) | PER17.7 | 31 | |
| Phone recognition | TIMIT (test) | Frame Error Rate17.7 | 23 | |
| Phoneme Recognition | TIMIT core (test) | PER17.7 | 20 | |
| Online Speech Recognition | TIMIT (test) | PER0.196 | 6 |