Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Diffusion-based Unsupervised Audio-visual Speech Enhancement

About

This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous diffusion-based method. Code and demo available at: https://jeaneudesayilo.github.io/fast_UdiffSE

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda• 2024

Related benchmarks

TaskDatasetResultRank
Speech EnhancementVoiceBank + DEMAND (VB-DMD) (test)
PESQ3.29
105
Speech EnhancementWSJ0–QUT (test)
SI-SDR3.24
23
Showing 2 of 2 rows

Other info

Follow for update