Generative Spoken Language Modeling from Raw Audio

About

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux• 2021

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER15.57	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER15.25	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)6.59	340
Automatic Speech Recognition	Librispeech (test-clean)	WER6.6	96
Semantic understanding	Topic-StoryCloze S→S	Accuracy66.6	10
Syntactic knowledge evaluation	sBLIMP ZeroResource Challenge 2021 (dev)	Success Rate57.1	9
Zero-shot Speech Evaluation	sBLIMP	sBLIMP Score57.1	7
Grammatical knowledge	sBLIMP Speech	Accuracy54.2	7
Lexical knowledge	sWUGGY Speech (test)	Accuracy64.8	7
Zero-shot Speech Evaluation	sWUGGY	sWUGGY In-Vocab Score68.7	7

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord