From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

About

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.

Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre D\'efossez• 2023

Related benchmarks

Task	Dataset	Result
Audio generation from Encodec tokens	universal evaluation dataset	PESQ2.094	27
Audio Compression Quality Assessment	Audio 24kHz	Speech Quality Score84.68	12
Audio Reconstruction	Various (test)	PESQ2.488	11
Audio Synthesis	LJSpeech (test)	GPU Execution Time4.82	6
Audio Reconstruction	Audio Evaluation Set 4 modalities, 150 samples per category	ViSQOL3.67	6
Audio Generation (Average)	Bark Average official suno-ai implementation (test)	MUSHRA Score73.86	2
Singing Voice Generation	Bark Singing Voices official suno-ai implementation (test)	MUSHRA Score73.67	2
Text-to-Music	MusicGen Music open source version (test)	MUSHRA Score74.97	2
Text-to-Speech	Bark official suno-ai implementation (test)	MUSHRA Score76.04	2

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord