Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

WaveFlow: A Compact Flow-based Model for Raw Audio

About

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$\times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song• 2019

Related benchmarks

TaskDatasetResultRank
Speech SynthesisLJ Speech (test)
MOS4.43
36
Audio GenerationLJ Speech (test)
LL Score5.101
20
Audio SynthesisLJSpeech (unseen)
MAE0.3674
10
Neural VocodingLibriTTS clean (dev)
MAE0.2839
10
Neural VocodingVCTK 100 audio clips (unseen)
MAE0.2982
10
VocodingLibriTTS (dev-other)
MAE0.3359
10
Universal Neural VocodingLibriTTS clean and other (dev)
M-STFT1.112
6
Showing 7 of 7 rows

Other info

Code

Follow for update