Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

About

Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling flexible zero-shot control. Experiments demonstrate that DisCo-Speech achieves competitive voice cloning and superior zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis.

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng• 2025

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibrispeech (test-clean)
STOI0.86
49
Voice CloningSeed-TTS en (test)
WER3.01
8
Voice CloningSeed-TTS-Eval zh (test)
CER2.08
8
Controllable Speech SynthesisProsody-timbre Evaluation Set Emotion scenario
DisCo-Speech Preference50.6
4
Controllable Speech SynthesisProsody-timbre Evaluation Set Style scenario
DisCo-Speech Preference51.5
4
Voice ConversionVoice Conversion (VC) Zero-shot
UTMOS3.98
4
Showing 6 of 6 rows

Other info

Follow for update