Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

About

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the Vevo2's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS zh (test)	WER0.0294	87
Text-to-Speech	SeedTTS en (test)	WER3.639	21
Pitch Style Conversion	VocalSet and GTSinger	nMOS3.937	18
Singing Voice Conversion	SVC English	WER11.64	8
Zero-shot Text-to-Speech	Singing Voice	WER7.66	8
Singing Voice Conversion	Chinese SVC	WER14.53	8
Zero-shot Text-to-Speech	Expressive Speech	WER11.48	8
Singing Voice Synthesis	Seen (test)	MOS-Q3.85	8
Voice Conversion	SeedTTS VC English (test)	WER3.53	8
Voice Conversion	SeedTTS VC Chinese (test)	WER3.01	8

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord