Robust Speech Activity Detection in the Presence of Singing Voice

About

Speech Activity Detection (SAD) systems often misclassify singing as speech, leading to degraded performance in applications such as dialogue enhancement and automatic speech recognition. We introduce Singing-Robust Speech Activity Detection ( SR-SAD ), a neural network designed to robustly detect speech in the presence of singing. Our key contributions are: i) a training strategy using controlled ratios of speech and singing samples to improve discrimination, ii) a computationally efficient model that maintains robust performance while reducing inference runtime, and iii) a new evaluation metric tailored to assess SAD robustness in mixed speech-singing scenarios. Experiments on a challenging dataset spanning multiple musical genres show that SR-SAD maintains high speech detection accuracy (AUC = 0.919) while rejecting singing. By explicitly learning to distinguish between speech and singing, SR-SAD enables more reliable SAD in mixed speech-singing scenarios.

Philipp Grundhuber, Mhd Modar Halimeh, Martin Strau{\ss}, Emanu\"el A. P. Habets• 2025

Related benchmarks

Task	Dataset	Result	Rank
Speech Activity Detection	MoisesDB (singing-only)	Accuracy0.9786		10
Speech Activity Detection	MoisesDB	RTF32		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord