Unsupervised Speech Recognition

About

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli• 2021

Related benchmarks

Task	Dataset	Result
Phoneme Recognition	TIMIT (test)	PER16.8	33
Speech Recognition	Multilingual LibriSpeech (MLS) (test)	--	21
Phoneme Recognition	TIMIT core (test)	PER17.8	20
Unsupervised Automatic Speech Recognition	LibriSpeech 100 hours (dev-clean)	PER19.3	7
Unsupervised Automatic Speech Recognition	LibriSpeech 100 hours (dev-other)	PER22.9	7
Unsupervised Automatic Speech Recognition	LibriSpeech 100 hours (test-clean)	PER19.3	7
Unsupervised Automatic Speech Recognition	LibriSpeech 100 hours (test-other)	PER0.232	7
Phoneme Recognition	TIMIT core (dev)	PER17.1	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord