Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

About

In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).

Duojia Li, Shuhan Zhang, Zihan Qian, Wenxuan Wu, Shuai Wang, Qingyang Hong, Lin Li, Haizhou Li• 2026

Related benchmarks

TaskDatasetResultRank
Target Speaker ExtractionLibri2Mix Clean min 16 kHz
PESQ3.27
9
Target Speaker ExtractionLibri2Mix Noisy min 16 kHz
PESQ2.28
8
Target Speaker ExtractionREAL-T DipCo English
DNSMOS OVRL1.56
6
Target Speaker ExtractionREAL-T AISHELL-4 Chinese
DNSMOS OVRL2.277
6
Target Speaker ExtractionREAL-T AliMeeting Chinese
DNSMOS OVRL2.086
6
Target Speaker ExtractionREAL-T AMI English
DNSMOS OVRL2.169
6
Target Speaker ExtractionREAL-T CHiME-6 English subset
DNSMOS OVRL1.858
6
Showing 7 of 7 rows

Other info

Follow for update