Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generative Spoken Language Modeling from Raw Audio

About

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux• 2021

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER15.57
966
Automatic Speech RecognitionLibriSpeech (dev-other)
WER15.25
411
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)6.59
319
Automatic Speech RecognitionLibrispeech (test-clean)
WER6.6
84
Semantic understandingTopic-StoryCloze S→S
Accuracy66.6
10
Syntactic knowledge evaluationsBLIMP ZeroResource Challenge 2021 (dev)
Success Rate57.1
9
Zero-shot Speech EvaluationsBLIMP
sBLIMP Score57.1
7
Grammatical knowledgesBLIMP Speech
Accuracy54.2
7
Lexical knowledgesWUGGY Speech (test)
Accuracy64.8
7
Zero-shot Speech EvaluationsWUGGY
sWUGGY In-Vocab Score68.7
7
Showing 10 of 14 rows

Other info

Follow for update