An Investigation of Incorporating Mamba for Speech Enhancement
About
This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to ~12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Enhancement | VoiceBank + DEMAND (VB-DMD) (test) | PESQ3.69 | 105 | |
| Speech Enhancement | VCTK+DEMAND (test) | WB-PESQ3.52 | 13 | |
| Phase Retrieval | VoiceBank Corpus (test) | PESQ4.59 | 8 | |
| Speech Denoising | VoiceBank+DEMAND (test) | PESQ3.564 | 7 | |
| Speech Dereverberation | WSJ0+WHAMR! (test) | WB-PESQ3.577 | 5 | |
| Composite Denoising and Dereverberation | WSJ0+WHAMR! (test) | WB-PESQ2.372 | 5 | |
| Speech Denoising | WSJ0+WHAMR! (test) | WB-PESQ2.658 | 5 | |
| Composite Denoising, Dereverberation, and Bandwidth Extension | WSJ0+WHAMR! (test) | WB-PESQ2.066 | 5 | |
| Speech Bandwidth Extension | WSJ0+WHAMR! (test) | WB-PESQ3.305 | 5 | |
| Speech Denoising | DNS Non-Reverberant 2020 (test) | PESQ2.44 | 5 |