Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
About
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Reconstruction | LibriTTS clean (test) | PESQ2.8069 | 50 | |
| Speech Reconstruction | LibriTTS (test-other) | UTMOS3.1956 | 44 | |
| Audio Reconstruction | LJSpeech | UTMOS4.0332 | 26 | |
| Analysis-synthesis | Music Academic | FAD0.017 | 24 | |
| Waveform Generation | MUSDB18 out-of-distribution vocal samples HQ (test) | M-STFT1.0203 | 19 | |
| Audio Generation | LibriTTS (dev) | M-STFT0.858 | 18 | |
| Speech Synthesis | LibriTTS (test) | MOS4.8577 | 17 | |
| Text-to-Speech | LibriSpeech clean PC (test) | WER (%)2.32 | 17 | |
| Analysis-synthesis | Audio Industrial | FAD0.018 | 12 | |
| Analysis-synthesis | Music Industrial | FAD0.037 | 12 |