Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

About

Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

Dong Yang, Yiyi Cai, Haoyu Zhang, Yuki Saito, Hiroshi Saruwatari• 2026

Related benchmarks

TaskDatasetResultRank
Zero-shot Text-to-SpeechSeed-TTS en (test)
WER1.777
25
Text-to-SpeechCosyVoice en 3 (test)
UTMOS3.238
10
Text-to-SpeechCosyVoice zh 3 (test)
UTMOS2.438
10
Subjective Speech Synthesis EvaluationSeed-TTS and CosyVoice en 3 (test)
CMOS0.00e+0
7
Subjective Speech Synthesis EvaluationSeed-TTS and CosyVoice zh 3 (test)
CMOS0.00e+0
7
Showing 5 of 5 rows

Other info

Follow for update