Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improved training of end-to-end attention models for speech recognition

About

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.

Albert Zeyer, Kazuki Irie, Ralf Schl\"uter, Hermann Ney• 2018

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER12.76
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.8
833
Automatic Speech RecognitionLibriSpeech (dev-other)
WER11.5
411
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)3.5
319
Speech RecognitionHub5'00 SWB (test)
WER8.3
91
Speech RecognitionHub5'00 CH (test)
WER25.7
28
Speech RecognitionLibriSpeech clean 1000h (test)
WER0.0382
9
Automatic Speech RecognitionHub5'01 (test)--
8
Showing 8 of 8 rows

Other info

Code

Follow for update