Geneses: Unified Generative Speech Enhancement and Separation
About
Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex degradations. The results demonstrate that Geneses significantly outperforms a conventional mask-based SE--SS method across various objective metrics with high robustness against complex degradations. Audio samples are available in our demo page.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Restoration and Separation | CallFriend German multilingual | NISQA3.646 | 5 | |
| Speech Restoration and Separation | CallFriend Japanese multilingual | NISQA3.728 | 5 | |
| Speech Restoration and Separation | CallFriend Spanish multilingual | NISQA3.559 | 5 | |
| Speech Restoration and Separation | CallFriend Mandarin multilingual | NISQA3.949 | 5 | |
| Speech Restoration and Separation | CallFriend French multilingual | NISQA3.753 | 5 | |
| Speech Separation | OpenDialog in-the-wild | NISQA3.809 | 5 | |
| Speech Separation and Restoration | SWB (evaluation) | MOS3.482 | 4 | |
| Speech Enhancement and Separation | LibriTTS-R Background Noise Only (test) | DNSMOS3.4 | 3 | |
| Speech Enhancement and Separation | LibriTTS-R Complex Degradations (test) | DNSMOS3.39 | 3 | |
| Speech separation and enhancement | LibriTTS-R Background Noise Only (test) | ESTOI0.75 | 2 |