FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

About

This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.

Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, Kai-Tuo Xu• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER3.8	121
Text-to-Speech	Seed-TTS zh (test)	WER0.0151	87
Text-to-Speech	LibriSpeech PC clean (test)	WER2.69	46
Text-to-Speech	Seed-TTS Seed-EN (test)	WER0.0382	32
Text-to-Speech	SeedTTS en (test)	WER1.652	21
Text-to-Speech	Seed-TTS-Eval zh (test)	CER1.51	21
Text-to-Speech	Chinese standard (test)	CER1.51	21
Text-to-Speech	English (test)	WER0.0382	21
Text-to-Speech	Seed-zh (test)	CER1.14	17
Text-to-Speech	LibriSpeech clean PC (test)	WER (%)2.69	17

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord