Robust Singing Voice Transcription Serves Synthesis

About

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.

Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao• 2024

Related benchmarks

Task	Dataset	Result
Note-level Singing Voice Transcription	Mandarin datasets M4Singer and D1 (clean)	COn (F)94	10
Note-level Singing Voice Transcription	Mandarin datasets M4Singer and D1 (noisy)	COn (F)93.8	10
Automatic Melody Transcription	Opencpop	MAEpitch0.38	5
Automatic Melody Transcription	ACE-KiSing	MAE (Pitch)1.08	4
Note Pitch Prediction	Note Pitch Prediction Dataset Moderately Detuned (test)	RPA89.6	4
Note Pitch Prediction	Note Pitch Prediction Dataset Highly Detuned (test)	RPA78.75	4
Note Transcription and Alignment	Multilingual singing dataset Chinese and English (test)	COnPOff(F)70.2	3
Singing Voice Transcription	MIR-ST500	COn72.1	2
Singing Voice Transcription	TONAS	COn55.7	2

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord