Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

About

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Hubert Siuzdak• 2023

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	LibriTTS clean (test)	PESQ2.807	63
Speech Reconstruction	LibriTTS (test-other)	UTMOS3.1956	44
Text-to-Speech	Seed-TTS Seed-EN (test)	WER0.0209	32
Audio Reconstruction	LJSpeech	UTMOS4.0332	26
Analysis-synthesis	Music Academic	FAD0.017	24
Speech Enhancement	Speech Enhancement (SE) Task (test)	PESQ2.06	22
Speech Synthesis	LibriTTS (ID)	PESQ3.6266	20
Waveform Generation	MUSDB18 out-of-distribution vocal samples HQ (test)	M-STFT1.0203	19
Neural Vocoding	LibriTTS (test)	PESQ3.8362	18
Audio Generation	LibriTTS (dev)	M-STFT0.858	18

Showing 10 of 35 rows

Other info

Code

Follow for update

@wizwand_team Discord