vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
About
We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.
Alexei Baevski, Steffen Schneider, Michael Auli• 2019
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER18.2 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER6.2 | 833 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER15.5 | 411 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)5.6 | 319 | |
| Speech Recognition | WSJ (92-eval) | WER8.57 | 131 | |
| Speech Recognition | WSJ nov93 (dev) | WER4.46 | 52 | |
| Image Reconstruction | CelebA-HQ (test) | FID (Reconstruction)12.03 | 50 | |
| Semantic Image Synthesis | ADE20K (val) | FID37.51 | 47 | |
| Speech Recognition | WSJ nov92 (test) | WER2.34 | 34 | |
| Phoneme Recognition | TIMIT (test) | PER11.4 | 31 |
Showing 10 of 21 rows