Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

About

Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

Dong Yang, Yiyi Cai, Haoyu Zhang, Yuki Saito, Hiroshi Saruwatari• 2026

Related benchmarks

Task	Dataset	Result
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.777	25
Text-to-Speech	CosyVoice en 3 (test)	UTMOS3.238	10
Text-to-Speech	CosyVoice zh 3 (test)	UTMOS2.438	10
Subjective Speech Synthesis Evaluation	Seed-TTS and CosyVoice en 3 (test)	CMOS0.00e+0	7
Subjective Speech Synthesis Evaluation	Seed-TTS and CosyVoice zh 3 (test)	CMOS0.00e+0	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord