Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

About

Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.

Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy• 2020

Related benchmarks

TaskDatasetResultRank
Speech EnhancementDNS with reverb (test)--
18
Speech DenoisingDNS no-reverb (test)
PESQ (WB)2.745
16
Speech EnhancementDNS Challenge Without Reverb (test)--
14
Speech EnhancementDNS Challenge 2020 (test)
DNSMOS Score3.52
9
Speech EnhancementDNS Challenge 2020
PESQ2.75
8
Speech EnhancementDNS Challenge With Reverb 2020 (test)
WB-PESQ2.832
7
Speech EnhancementDNS Challenge INTERSPEECH Without Reverb 2020 (test)
WB-PESQ2.748
7
Noise SuppressionDNS Challenge Synthetic without Reverb blind 2020 (test)
MOS4.07
2
Noise SuppressionDNS Challenge Synthetic with Reverb 2020 (test)
MOS3.19
2
Noise SuppressionDNS Challenge blind Real Recordings 2020 (test)
MOS3.4
2
Showing 10 of 10 rows

Other info

Follow for update