Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

About

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar• 2020

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER4.6
1151
Automatic Speech RecognitionLibriSpeech (dev-other)
WER5.28
462
Automatic Speech RecognitionLibriSpeech 960h (test-other)
WER4.6
88
Speech RecognitionLibriSpeech clean (dev)
WER0.0216
80
Speech RecognitionLibriSpeech (test)--
76
Automatic Speech RecognitionLibriSpeech 960h (test-clean)
WER0.02
60
Showing 7 of 7 rows

Other info

Follow for update