A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement

About

It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling.

Jie Zhang, Haoyin Yan, Xiaofei Li• 2025

Related benchmarks

Task	Dataset	Result
Speech Enhancement	WSJ0 UNI	PESQ2.95	15
Speech Denoising	VBDMD (test)	PESQ3.11	12
Speech Super-resolution	VBDMD-SR (test)	PESQ4.09	10
Speech Enhancement	DNS Challenge no-reverb	DNSMOS3.333	9
Speech Enhancement	DNS Challenge HardSet	DNSMOS3.251	8
Speech Enhancement	DNS Challenge GSR	DNSMOS3.291	6
Speech Enhancement	VCTK GSR	DNSMOS3.057	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord