DiffWave: A Versatile Diffusion Model for Audio Synthesis
About
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Trajectory Generation | Chengdu | Density0.0145 | 11 | |
| Trajectory Generation | Xi'an | Density2.13 | 11 | |
| Source Separation | LibriSpeech 2Mix | SI-SDRi20.8 | 10 | |
| Neural Vocoding | LJSpeech | MOS4.49 | 9 | |
| Speech Separation | Libri-5Mix | SI-SDRi (dB)13 | 9 | |
| Speech Separation | Libri-10Mix | SI-SDRi (dB)8.1 | 9 | |
| Neural Vocoding | VCTK (unseen speakers) | MOS3.9 | 8 | |
| Source Separation | WSJ0 3mix | SI-SDRi20.3 | 8 | |
| Time Series Reconstruction | PTB-XL (test) | PRD28.23 | 8 | |
| Reconstruction | SleepEDF | PRD41.58 | 8 |