FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis

About

Current non-autoregressive (NAR) text-to-speech (TTS) systems still struggle to model diverse and speaker-dependent duration variation. We further observe that richer duration variation can increase the synthesis difficulty of existing HiFi-GAN-based vocoders, leading to spectral artifacts and unstable time-frequency structures. To address these issues, we propose FNH-TTS, a VITS-based end-to-end TTS system with Mixture-of-Experts duration modeling and robust vocoder-side synthesis. Specifically, we introduce a Mixture-of-Experts Duration Predictor (MoE-DP) to capture diverse phoneme duration patterns and speaker-dependent speaking-rate characteristics. To convert richer duration variation into stable waveform generation, we further integrate a VOCOS-style vocoder with Collaborative Multi-Band and Sub-Band Discriminators. Experiments on LJSpeech, VCTK, and LibriTTS show that FNH-TTS achieves improved synthesis quality, duration-category accuracy, vocoder reconstruction quality, and inference efficiency. Further analysis shows that MoE-DP is the main source of improved duration modeling, while stronger vocoder-side components are necessary for robust synthesis under richer duration variation.

Qingliang Meng, Yuqing Deng, Wei Liang, Limei Yu, Huizhi Liang, Tian Li• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech Synthesis	LJSpeech	MOS4.48	11
Text-to-Speech Synthesis	VCTK	MOS4.63	9
Speech Synthesis	Libri 460	Duration Accuracy67.07	8

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord