Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

About

Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.

Xiaosu Su, Zihan Sun, Peilei Jia, Jun Gao• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechInstructTTSEval ZH
APS84.1
24
Single-utterance Voice DesignHuman Evaluation set for single-utterance voice design
Overall Score4.24
5
Dialogue voice design40 multi-turn dialogues
SIM80.8
2
Showing 3 of 3 rows

Other info

Follow for update