Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
About
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.6 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2 | 833 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER5.28 | 411 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-other) | WER4.6 | 81 | |
| Speech Recognition | LibriSpeech clean (dev) | WER0.0216 | 59 | |
| Speech Recognition | LibriSpeech (test) | -- | 59 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-clean) | WER0.02 | 53 |