F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

About

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER1.26	121
Text-to-Speech	LibriSpeech clean (test)	WER2.42	88
Text-to-Speech	Seed-TTS zh (test)	WER0.0153	87
Text-to-Speech	LibriSpeech PC clean (test)	WER1.89	46
Text-to-Speech	Seed-TTS (eval)	WER2	39
Text-to-Speech	Seed-TTS Seed-EN (test)	WER0.0183	32
Text-to-Speech	Seed-TTS EN	WER1.83	32
Text-to-Speech	Seed-TTS-Eval (test)	WER2	32
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.89	25
Text-to-Speech	EmergentTTS (eval)	Overall WER11.93	25

Showing 10 of 103 rows

...

Other info

Code

Follow for update

@wizwand_team Discord