Less is More: Data Curation Matters in Scaling Speech Enhancement

About

The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks in other domains. However, recent studies reveal diminishing returns when scaling speech enhancement data. We focus on a critical factor: prevalent quality issues in ``clean'' training labels within large-scale datasets. This work re-examines this phenomenon and demonstrates that, within large-scale training sets, prioritizing high-quality training data is more important than merely expanding the data volume. Experimental findings suggest that models trained on a carefully curated subset of 700 hours can outperform models trained on the 2,500-hour full dataset. This outcome highlights the crucial role of data curation in scaling speech enhancement systems effectively.

Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian• 2025

Related benchmarks

Task	Dataset	Result
Speech Enhancement	DNS No-Reverb 1 (test)	DNSMOS3.34	19
Speech Enhancement	DNS1 With-Reverb (test)	DNSMOS2.54	19
Speech Enhancement	Librispeech simulated general-SNR (test)	DNSMOS3.2	11
Speech Enhancement	Librispeech simulated low-SNR (test)	DNSMOS3.16	11
Universal Speech Enhancement	URGENT Challenge 2026 (val)	DNSMOS3.19	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord