Self-Training for End-to-End Speech Recognition

About

We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.

Jacob Kahn, Ann Lee, Awni Hannun• 2019

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER5.93	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER20.11	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER18.95	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)5.37	340
Automatic Speech Recognition	LibriSpeech 100h (test-clean)	WER5.79	43
Automatic Speech Recognition	LibriSpeech 100h clean (dev)	WER5.41	20

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord