ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

About

Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available at https://github.com/k2-fsa/ZipVoice.

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER1.6	159
Text-to-Speech	Seed-TTS zh (test)	WER0.014	87
Text-to-Speech	LibriSpeech PC clean (test)	WER1.64	66
Voice Cloning	Seed-TTS en (test)	WER1.64	53
Voice Cloning	Seed-TTS-Eval zh (test)	CER1.4	37
Text-to-Speech	Seed-zh (test)	CER1.4	32
Text-to-Speech	Seed-en (test)	WER1.7	30
Text-to-Speech	Seed-TTS English (test)	WER1.7	14
Text-to-Speech	Seed-TTS en 24 kHz (test)	SIM-o0.697	11
Text-to-Speech	Seed-TTS 24 kHz (test-zh)	SIM-o0.751	11

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord