AudioLM: a Language Modeling Approach to Audio Generation

About

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

Zal\'an Borsos, Rapha\"el Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour• 2022

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER5.8	1207
Automatic Speech Recognition	LibriSpeech Other	WER12	123
Automatic Speech Recognition	LibriSpeech Clean	WER9.5	107
Automatic Speech Recognition	Librispeech (test-clean)	WER6	96
Text-to-Speech	Seed-TTS (eval)	WER4.9	39
Text-to-Speech	LibriTTS clean (test)	WER0.043	30
Speech Reconstruction	LibriSpeech clean (test)	WER2.7	25
Speech Recognition	Switchboard	WER15	20
Automatic Speech Recognition	Common Voice en 15	WER17.6	16
Automatic Speech Recognition	VoxPopuli 1.0 (test)	Avg WER15	14

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord