HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

About

Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion. This paper introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio. We use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech. The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.

Jiaqi Su, Zeyu Jin, Adam Finkelstein• 2020

Related benchmarks

Task	Dataset	Result
Speech Enhancement	VoiceBank + DEMAND (VB-DMD) (test)	PESQ2.94	114
Analysis-synthesis	Music Academic	FAD0.044	24
Analysis-synthesis	Audio Industrial	FAD0.037	12
Analysis-synthesis	Music Industrial	FAD0.085	12
Singing Voice Synthesis	Singing Voice Industrial setting	MOS Prediction3.93	11
Singing Voice Synthesis	Singing Voice Academic setting	MOS Prediction Score3.84	11
Speech Synthesis	Speech Industrial Setting	MOS Prediction4.11	11
Speech Synthesis	Speech Academic Setting	MOS Prediction3.29	11
Speech Denoising	VCTK-DEMAND (test)	PESQ2.94	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord