Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
About
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Zero-shot Text-to-Speech | Seed-TTS en (test) | WER1.777 | 25 | |
| Text-to-Speech | CosyVoice en 3 (test) | UTMOS3.238 | 10 | |
| Text-to-Speech | CosyVoice zh 3 (test) | UTMOS2.438 | 10 | |
| Subjective Speech Synthesis Evaluation | Seed-TTS and CosyVoice en 3 (test) | CMOS0.00e+0 | 7 | |
| Subjective Speech Synthesis Evaluation | Seed-TTS and CosyVoice zh 3 (test) | CMOS0.00e+0 | 7 |