Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

About

Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu• 2021

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER2.7
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.4
833
Automatic Speech RecognitionLibriSpeech (dev-other)
WER2.6
411
Automatic Speech RecognitionLibriSpeech 960h (test-other)
WER2.5
81
Speech RecognitionLibriSpeech (test)
WER0.014
59
Automatic Speech RecognitionLibriSpeech 960h (dev-other)
WER2.4
50
Speech RecognitionLibriSpeech 960hr (test)
WER1.4
26
Speech RecognitionLibriSpeech 960hr (dev)
WER1.3
25
Speech RecognitionLibriSpeech (dev)
WER1.3
21
Speech RecognitionGoogle Voice Search (test)
WER6.2
4
Showing 10 of 13 rows

Other info

Follow for update