vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

About

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Alexei Baevski, Steffen Schneider, Michael Auli• 2019

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER6.2	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER18.2	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER15.5	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)5.6	340
Speech Recognition	WSJ (92-eval)	WER8.57	131
Universal Speech Representation Evaluation	SUPERB Benchmark	Overall Score61.8	60
Speech Recognition	WSJ nov93 (dev)	WER4.46	52
Image Reconstruction	CelebA-HQ (test)	FID (Reconstruction)12.03	50
Semantic Image Synthesis	ADE20K (val)	FID37.51	47
Speech Recognition	WSJ nov92 (test)	WER2.34	34

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord