Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Robust Speech Activity Detection in the Presence of Singing Voice

About

Speech Activity Detection (SAD) systems often misclassify singing as speech, leading to degraded performance in applications such as dialogue enhancement and automatic speech recognition. We introduce Singing-Robust Speech Activity Detection ( SR-SAD ), a neural network designed to robustly detect speech in the presence of singing. Our key contributions are: i) a training strategy using controlled ratios of speech and singing samples to improve discrimination, ii) a computationally efficient model that maintains robust performance while reducing inference runtime, and iii) a new evaluation metric tailored to assess SAD robustness in mixed speech-singing scenarios. Experiments on a challenging dataset spanning multiple musical genres show that SR-SAD maintains high speech detection accuracy (AUC = 0.919) while rejecting singing. By explicitly learning to distinguish between speech and singing, SR-SAD enables more reliable SAD in mixed speech-singing scenarios.

Philipp Grundhuber, Mhd Modar Halimeh, Martin Strau{\ss}, Emanu\"el A. P. Habets• 2025

Related benchmarks

TaskDatasetResultRank
Speech Activity DetectionMoisesDB (singing-only)
Accuracy0.9786
10
Speech Activity DetectionMoisesDB
RTF32
5
Showing 2 of 2 rows

Other info

Follow for update