SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

About

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi• 2021

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.7	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.3	1206
Speech Recognition	WSJ (92-eval)	WER1.3	131
Automatic Speech Recognition	SWITCHBOARD swbd	WER4.8	39
Automatic Speech Recognition	TED-LIUM (test)	WER5.7	30
Automatic Speech Recognition	CHiME-6 (dev)	WER (%)26.2	28
Automatic Speech Recognition	CHiME-6 (eval)	WER31	20
Automatic Speech Recognition	TED-LIUM	WER5.3	20
Automatic Speech Recognition	AMI IHM	WER9.5	12
Automatic Speech Recognition	AMI SDM English (eval)	WER22.7	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord