Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

About

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi• 2021

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER3.3
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.7
833
Speech RecognitionWSJ (92-eval)
WER1.3
131
Automatic Speech RecognitionSWITCHBOARD swbd
WER4.8
39
Automatic Speech RecognitionCHiME-6 (dev)
WER (%)26.2
20
Automatic Speech RecognitionTED-LIUM (test)
WER5.7
19
Automatic Speech RecognitionCHiME-6 (eval)
WER31
12
Automatic Speech RecognitionAMI IHM
WER9.5
10
Automatic Speech RecognitionTED-LIUM
WER5.3
9
Automatic Speech RecognitionAMI SDM English (eval)
WER22.7
8
Showing 10 of 12 rows

Other info

Follow for update