DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

About

Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling flexible zero-shot control. Experiments demonstrate that DisCo-Speech achieves competitive voice cloning and superior zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis.

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng• 2025

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Librispeech (test-clean)	UT MOS4.1	64
Voice Cloning	Seed-TTS en (test)	WER3.01	53
Voice Cloning	Seed-TTS-Eval zh (test)	CER2.08	37
Controllable Speech Synthesis	Prosody-timbre Evaluation Set Emotion scenario	DisCo-Speech Preference50.6	4
Controllable Speech Synthesis	Prosody-timbre Evaluation Set Style scenario	DisCo-Speech Preference51.5	4
Voice Conversion	Voice Conversion (VC) Zero-shot	UTMOS3.98	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord