HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

About

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae• 2020

Related benchmarks

Task	Dataset	Result
Text-to-Speech	LibriSpeech clean (test)	WER2.17	97
Music Source Separation	MUSDB18 HQ (test)	SDR (Drums)4.37	61
Audio Deepfake Detection	ITW In-the-Wild	EER23.779	51
Audio Deepfake Detection	CodecFake	EER39.616	50
Speech Synthesis	LJ Speech (test)	MOS4.13	36
Audio Deepfake Detection	ASVspoof LA 2019 (eval)	EER0.201	36
Speech Enhancement	Speech Enhancement (SE) Task (test)	PESQ1.903	22
Speech Synthesis	LibriTTS (ID)	PESQ3	20
Audio Generation	LibriTTS (dev)	M-STFT1.3647	18
Neural Vocoding	LibriTTS (test)	PESQ3.056	18

Showing 10 of 64 rows

Other info

Code

Follow for update

@wizwand_team Discord