AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

About

In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).

Duojia Li, Shuhan Zhang, Zihan Qian, Wenxuan Wu, Shuai Wang, Qingyang Hong, Lin Li, Haizhou Li• 2026

Related benchmarks

Task	Dataset	Result
Target Speaker Extraction	Libri2Mix Clean min 16 kHz	PESQ3.27	9
Target Speaker Extraction	Libri2Mix Noisy min 16 kHz	PESQ2.28	8
Target Speaker Extraction	REAL-T DipCo English	DNSMOS OVRL1.56	6
Target Speaker Extraction	REAL-T AISHELL-4 Chinese	DNSMOS OVRL2.277	6
Target Speaker Extraction	REAL-T AliMeeting Chinese	DNSMOS OVRL2.086	6
Target Speaker Extraction	REAL-T AMI English	DNSMOS OVRL2.169	6
Target Speaker Extraction	REAL-T CHiME-6 English subset	DNSMOS OVRL1.858	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord