SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis
About
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Singing Voice Synthesis | GMO-SVS Chinese (test) | WER6.5 | 7 | |
| Singing Voice Synthesis | SoulX-Singer Chinese (eval) | WER6.9 | 7 | |
| Cross-Lingual Singing Voice Synthesis | SoulX-Singer-Eval | WER0.11 | 5 | |
| Singing Voice Synthesis | GMO-SVS English (test) | WER0.149 | 5 | |
| Singing Voice Synthesis | SoulX-Singer English (eval) | WER0.129 | 5 | |
| Singing Voice Editing | GMO-SVS Chinese (test) | WER0.089 | 4 | |
| Singing Voice Editing | GMO-SVS English (test) | WER21.3 | 3 |