Voting-based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection
About
The voting method, an ensemble approach for fundamental frequency estimation, is empirically known for its robustness but lacks thorough investigation. This paper provides a principled analysis and improvement of this technique. First, we offer a theoretical basis for its effectiveness, explaining the error variance reduction for fundamental frequency estimation and invoking Condorcet's jury theorem for voiced/unvoiced detection accuracy. To address its practical limitations, we propose two key improvements: 1) a pre-voting alignment procedure to correct temporal and frequential biases among estimators, and 2) a greedy algorithm to select a compact yet effective subset of estimators based on error correlation. Experiments on a diverse dataset of speech, singing, and music show that our proposed method with alignment outperforms individual state-of-the-art estimators in clean conditions and maintains robust voiced/unvoiced detection in noisy environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Voiced/Unvoiced Detection | Speech | V/UV Recall94.21 | 50 | |
| Fundamental Frequency Estimation | Speech, Singing Voice, and Music Clean | RPA (5 cents)0.2901 | 12 | |
| Fundamental Frequency Estimation | Speech SNR 30 dB | RPA5071.9 | 10 | |
| Fundamental Frequency Estimation | Speech SNR ∞ | RPA5076.78 | 10 | |
| Fundamental Frequency Estimation | Speech SNR 10 dB | RPA5061.5 | 10 | |
| Fundamental Frequency Estimation | Speech SNR 20 dB | RPA5060.4 | 10 | |
| Fundamental Frequency Estimation | Speech SNR 0 dB | RPA5042.27 | 10 |