Generative Spoken Language Modeling from Raw Audio
About
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER15.57 | 966 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER15.25 | 411 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)6.59 | 319 | |
| Automatic Speech Recognition | Librispeech (test-clean) | WER6.6 | 84 | |
| Semantic understanding | Topic-StoryCloze S→S | Accuracy66.6 | 10 | |
| Syntactic knowledge evaluation | sBLIMP ZeroResource Challenge 2021 (dev) | Success Rate57.1 | 9 | |
| Zero-shot Speech Evaluation | sBLIMP | sBLIMP Score57.1 | 7 | |
| Grammatical knowledge | sBLIMP Speech | Accuracy54.2 | 7 | |
| Lexical knowledge | sWUGGY Speech (test) | Accuracy64.8 | 7 | |
| Zero-shot Speech Evaluation | sWUGGY | sWUGGY In-Vocab Score68.7 | 7 |