Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

About

While LLM-based TTS models exhibit zero-shot emotion and speaker cloning, their cloning fidelity and pronunciation clarity degrade on unseen domains. Fine-tuning is essential for adaptation, yet uniform approaches overlook specific parameter contributions. Uniform tuning on limited data causes slow training and catastrophic forgetting, leading to degraded pronunciation accuracy. To address this, we propose CSP-FT, a characteristic-specific partial fine-tuning strategy. By dynamically analyzing layer contributions via a weighted sum, we selectively fine-tune only the two layers capturing the most and least emotion and speaker information, maximizing the utility of the former while explicitly strengthening the capacity of the latter. Experiments on a combined corpus of 11 datasets show CSP-FT matches or exceeds the fidelity and intelligibility of full fine-tuning while updating only ~8% of parameters, accelerating training by ~2x, and significantly mitigating catastrophic forgetting.

Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu Dang• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	English emotional dataset	SS Score94.8	48
Emotion and speaker adaptation	Chinese speech data (test)	SS (%)85.7	16
Emotional Text-to-Speech	ESD (English)	SMOS4.35	16

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord