SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

About

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le• 2019

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER5.8	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.5	1410
Automatic Speech Recognition	LibriSpeech (dev-other)	WER6.8	535
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)2.8	376
Speech Recognition	Hub5'00 SWB (test)	WER6.8	91
Automatic Speech Recognition	LibriSpeech 100h (test-clean)	WER5.5	64
Imagined digit classification	MegNIST	Accuracy71.3	28
Automatic Speech Recognition	LibriSpeech 100h clean (dev)	WER5.3	25
Respiratory sound classification	AKGC417L (IND)	Overall Score80.02	17
Respiratory sound classification	LittC2SE (OOD)	Specificity97.92	11

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord