wav2vec: Unsupervised Pre-training for Speech Recognition

About

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli• 2019

Related benchmarks

Task	Dataset	Result
Universal Speech Representation Evaluation	SUPERB Benchmark	Overall Score63.5	60
Speech Recognition	WSJ nov93 (dev)	WER5.1	52
Voice Classification	HC/PD/ALS Voice Cohort Cross-Cohort (External)	BalAcc39.37	52
Voice Classification	HC/PD/ALS Voice Cohort (Internal)	Balanced Accuracy0.4477	52
Speech Recognition	WSJ nov92 (test)	WER2.43	34
Phoneme Recognition	TIMIT (test)	PER14.7	33
Emotion Recognition	ER	Accuracy59.8	33
Speaker Identification	SID	Accuracy56.6	30
Speech Recognition	Wall Street Journal open vocabulary (dev93)	WER5.1	28
Speech Emotion Recognition	IEMOCAP (five-fold/ten-fold cross-validation)	WA59.79	25

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord