EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
About
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Contradictory-style Generation | VccmDataset | MOS-SA4.23 | 168 | |
| Emotion Transfer | VccmDataset (test) | Accuracy63.4 | 21 | |
| Text-to-Speech | TextrolSpeech + EmoVoice-DB (test) | MOS-N3.694 | 6 | |
| Contradictory-style Energy Control | Speech Synthesis (Evaluation Set) | Accuracy76.1 | 6 | |
| Contradictory-style Pitch Control | Speech Synthesis (Evaluation Set) | Accuracy73.1 | 6 |