EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

About

Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen• 2025

Related benchmarks

Task	Dataset	Result
Contradictory-style Generation	VccmDataset	MOS-SA4.23	168
Emotion Transfer	VccmDataset (test)	Accuracy63.4	21
Text-to-Speech	ESD	Score (Angry)50.7	15
Emotional Text-to-Speech	CREMA-D	Angry Accuracy32.3	15
Emotional Text-to-Speech	IEMOCAP	Angry Score54.2	15
Emotion Induction	IEMOCAP (test)	Anger Emo-SIM80.6	7
Text-to-Speech	TextrolSpeech + EmoVoice-DB (test)	MOS-N3.694	6
Contradictory-style Energy Control	Speech Synthesis (Evaluation Set)	Accuracy76.1	6
Contradictory-style Pitch Control	Speech Synthesis (Evaluation Set)	Accuracy73.1	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord