ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

About

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	LibriTTS zero-shot	--	14
Zero-shot Text-to-Speech	LibriTTS (test)	SECS0.765	12
Zero-shot Text-to-Speech	YoBind (test)	SECS (s)0.764	12
Zero-shot Text-to-Speech	YoBind	MOS (Naturalness)3.58	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord