Drax: Speech Recognition with Discrete Flow Matching

About

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER5.7	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.6	1410
Automatic Speech Recognition	AMI	WER13.9	46
Automatic Speech Recognition	VoxPopuli	WER8.6	44
Automatic Speech Recognition	Earnings-22	WER14.09	39
Automatic Speech Recognition	LibriSpeech (LS) clean	WER2.4	21
Automatic Speech Recognition	CV-zh	CER16.68	15
Automatic Speech Recognition	MLS FR (test)	WER7.1	13
Automatic Speech Recognition	AISHELL	CER7.13	12
Automatic Speech Recognition	VoxPopuli English	WER7.07	10

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord