Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Generative Spoken Language Modeling from Raw Audio

About

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux• 2021

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER15.57
1151
Automatic Speech RecognitionLibriSpeech (dev-other)
WER15.25
462
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)6.59
340
Automatic Speech RecognitionLibrispeech (test-clean)
WER6.6
84
Semantic understandingTopic-StoryCloze S→S
Accuracy66.6
10
Syntactic knowledge evaluationsBLIMP ZeroResource Challenge 2021 (dev)
Success Rate57.1
9
Zero-shot Speech EvaluationsBLIMP
sBLIMP Score57.1
7
Grammatical knowledgesBLIMP Speech
Accuracy54.2
7
Lexical knowledgesWUGGY Speech (test)
Accuracy64.8
7
Zero-shot Speech EvaluationsWUGGY
sWUGGY In-Vocab Score68.7
7
Showing 10 of 14 rows

Other info

Follow for update