MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow
About
Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single inference step. These results demonstrate that mean-flow-guided one-step generation offers an effective and efficient alternative for real-time target speaker extraction. Code is available at https://github.com/rikishimizu/MeanFlow-TSE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Target Speaker Extraction | Libri2Mix Clean | DNSMOS OVL3.69 | 14 | |
| Target Speaker Extraction | Libri2Mix Clean min 16 kHz | PESQ3.26 | 9 | |
| Target Speaker Extraction | Libri2Mix Noisy min 16 kHz | PESQ2.21 | 8 | |
| Target Speaker Extraction | Libri2Mix noisy | PESQ2.21 | 7 | |
| Target Speaker Extraction | REAL-T AMI English | DNSMOS OVRL2.178 | 6 | |
| Target Speaker Extraction | REAL-T CHiME-6 English subset | DNSMOS OVRL1.896 | 6 | |
| Target Speaker Extraction | REAL-T AISHELL-4 Chinese | DNSMOS OVRL2.258 | 6 | |
| Target Speaker Extraction | REAL-T AliMeeting Chinese | DNSMOS OVRL2.058 | 6 | |
| Target Speaker Extraction | REAL-T DipCo English | DNSMOS OVRL1.475 | 6 |