ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

About

In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.

Haoxu Wang, Biao Tian• 2025

Related benchmarks

Task	Dataset	Result
Speech Enhancement	VoiceBank-DEMAND (test)	PESQ3.628	201
Speech Enhancement	DNS no_reverb (test)	PESQ3.419	46
Speech Enhancement	DNS Challenge Real Recordings (test)	SIG Score3.323	41
Speech Enhancement	DNS with reverb (test)	STOI28.8	27
Speech Enhancement	DNS non-blind 2020 (test)	SI-SNR22.22	12
Composite Denoising and Dereverberation	WSJ0+WHAMR! (test)	WB-PESQ2.401	5
Composite Denoising, Dereverberation, and Bandwidth Extension	WSJ0+WHAMR! (test)	WB-PESQ2.169	5
Speech Bandwidth Extension	WSJ0+WHAMR! (test)	WB-PESQ3.486	5
Speech Denoising	WSJ0+WHAMR! (test)	WB-PESQ2.717	5
Speech Dereverberation	WSJ0+WHAMR! (test)	WB-PESQ3.501	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord