WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching

About

Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

Tianze Luo, Xingchen Miao, Wenbo Duan• 2025

Related benchmarks

Task	Dataset	Result
Speech Synthesis	LibriTTS (ID)	PESQ3.954	20
Speech Synthesis	LibriTTS (test)	--	17
Text-to-Speech	LibriSpeech clean PC (test)	--	17
Speech Synthesis	AISHELL3 Mandarin	UTMOS2.037	14
Speech Synthesis	Sound Effect (evaluation)	M-STFT1.001	13
Zero-shot Text-to-Speech	LibriSpeech PC clean (test)	WER2.01	12
Speech Synthesis	LibriTTS (dev)	M-STFT0.884	11
Speech Synthesis	EARS	PESQ3.856	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord