FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis
About
Current non-autoregressive (NAR) text-to-speech (TTS) systems still struggle to model diverse and speaker-dependent duration variation. We further observe that richer duration variation can increase the synthesis difficulty of existing HiFi-GAN-based vocoders, leading to spectral artifacts and unstable time-frequency structures. To address these issues, we propose FNH-TTS, a VITS-based end-to-end TTS system with Mixture-of-Experts duration modeling and robust vocoder-side synthesis. Specifically, we introduce a Mixture-of-Experts Duration Predictor (MoE-DP) to capture diverse phoneme duration patterns and speaker-dependent speaking-rate characteristics. To convert richer duration variation into stable waveform generation, we further integrate a VOCOS-style vocoder with Collaborative Multi-Band and Sub-Band Discriminators. Experiments on LJSpeech, VCTK, and LibriTTS show that FNH-TTS achieves improved synthesis quality, duration-category accuracy, vocoder reconstruction quality, and inference efficiency. Further analysis shows that MoE-DP is the main source of improved duration modeling, while stronger vocoder-side components are necessary for robust synthesis under richer duration variation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech Synthesis | LJSpeech | MOS4.48 | 11 | |
| Text-to-Speech Synthesis | VCTK | MOS4.63 | 9 | |
| Speech Synthesis | Libri 460 | Duration Accuracy67.07 | 8 |