FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

About

Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inference latency. To address these challenges, we propose FlowSE, a flow-matching-based model for SE. Flow matching learns a continuous transformation between noisy and clean speech distributions in a single pass, significantly reducing inference latency while maintaining high-quality reconstruction. Specifically, FlowSE trains on noisy mel spectrograms and optional character sequences, optimizing a conditional flow matching loss with ground-truth mel spectrograms as supervision. It implicitly learns speech's temporal-spectral structure and text-speech alignment. During inference, FlowSE can operate with or without textual information, achieving impressive results in both scenarios, with further improvements when transcripts are available. Extensive experiments demonstrate that FlowSE significantly outperforms state-of-the-art generative methods, establishing a new paradigm for generative-based SE and demonstrating the potential of flow matching to advance the field. Our code, pre-trained checkpoints, and audio samples are available.

Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie• 2025

Related benchmarks

Task	Dataset	Result
Speech Enhancement	DNS Challenge Real Recordings (test)	SIG Score3.643	41
Speech Enhancement	DNS no-reverb 2020 (test)	Signal Score (SIG)3.685	30
Speech Enhancement	DNS Challenge With Reverb (test)	SIG3.614	24
Speech Enhancement	DNS No-Reverb 1 (test)	DNSMOS3.38	19
Speech Enhancement	DNS1 With-Reverb (test)	DNSMOS3.34	19
Speech Enhancement	DNS Challenge no-reverb	DNSMOS3.265	9
Speech Enhancement	Simulated (test)	DNSMOS3.28	8
Speech Enhancement	DNS Challenge HardSet	DNSMOS2.94	8
Automatic Speech Recognition	LibriSpeech noisy (test)	WER0.3553	5
Speech Enhancement	LibriSpeech noisy (test)	SIG Score3.539	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord