WaveFlow: A Compact Flow-based Model for Raw Audio
About
In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$\times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Synthesis | LJ Speech (test) | MOS4.43 | 36 | |
| Audio Generation | LJ Speech (test) | LL Score5.101 | 20 | |
| Audio Synthesis | LJSpeech (unseen) | MAE0.3674 | 10 | |
| Neural Vocoding | LibriTTS clean (dev) | MAE0.2839 | 10 | |
| Neural Vocoding | VCTK 100 audio clips (unseen) | MAE0.2982 | 10 | |
| Vocoding | LibriTTS (dev-other) | MAE0.3359 | 10 | |
| Universal Neural Vocoding | LibriTTS clean and other (dev) | M-STFT1.112 | 6 |