SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
About
We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER3.3 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.7 | 833 | |
| Speech Recognition | WSJ (92-eval) | WER1.3 | 131 | |
| Automatic Speech Recognition | SWITCHBOARD swbd | WER4.8 | 39 | |
| Automatic Speech Recognition | CHiME-6 (dev) | WER (%)26.2 | 20 | |
| Automatic Speech Recognition | TED-LIUM (test) | WER5.7 | 19 | |
| Automatic Speech Recognition | CHiME-6 (eval) | WER31 | 12 | |
| Automatic Speech Recognition | AMI IHM | WER9.5 | 10 | |
| Automatic Speech Recognition | TED-LIUM | WER5.3 | 9 | |
| Automatic Speech Recognition | AMI SDM English (eval) | WER22.7 | 8 |