Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Non-Autoregressive Neural Text-to-Speech

About

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao• 2019

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechParaNet 100 sentences (test)
Repeat Errors1
6
Showing 1 of 1 rows

Other info

Follow for update