MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

About

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at https://maskgct.github.io/. We release our code and model checkpoints at https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct.

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER2.03	159
Text-to-Speech	LibriSpeech clean (test)	WER2.6	97
Text-to-Speech	Seed-TTS zh (test)	WER0.0227	87
Text-to-Speech	LibriSpeech PC clean (test)	WER2.26	66
Voice Cloning	Seed-TTS en (test)	WER2.62	53
Text-to-Speech	Seed-ZH	CER2.27	42
Text-to-Speech	Seed EN	WER2.62	41
Text-to-Speech	Seed-TTS-Eval (test)	WER2.57	40
Text-to-Speech	Seed-TTS (eval)	WER2.62	39
Voice Cloning	Seed-TTS-Eval zh (test)	CER2.27	37

Showing 10 of 73 rows

...

Other info

Follow for update

@wizwand_team Discord