MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow

About

Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single inference step. These results demonstrate that mean-flow-guided one-step generation offers an effective and efficient alternative for real-time target speaker extraction. Code is available at https://github.com/rikishimizu/MeanFlow-TSE.

Riki Shimizu, Xilin Jiang, Nima Mesgarani• 2025

Related benchmarks

Task	Dataset	Result
Multi-talker Automatic Speech Recognition	Libri2Mix Clean (test)	WER9.05	21
Target Speaker Extraction	Libri2Mix Clean (test)	--	20
Target Speaker Extraction	Libri2Mix Clean	DNSMOS OVL3.69	14
Target Speaker Extraction	Libri2Mix Clean min 16 kHz	PESQ3.26	9
Target Speaker Extraction	Libri2Mix Noisy min 16 kHz	PESQ2.21	8
Target Speaker Extraction	Libri2Mix noisy	PESQ2.21	7
Target Speaker Extraction	REAL-T AMI English	DNSMOS OVRL2.178	6
Target Speaker Extraction	REAL-T CHiME-6 English subset	DNSMOS OVRL1.896	6
Target Speaker Extraction	REAL-T AISHELL-4 Chinese	DNSMOS OVRL2.258	6
Target Speaker Extraction	REAL-T AliMeeting Chinese	DNSMOS OVRL2.058	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord