Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EvoTSE: Evolving Enrollment for Target Speaker Extraction

About

Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.

Zikai Liu, Ziqian Wang, Xingchen Li, Yike Zhu, Shuai Wang, Longshuai Xiao, Lei Xie• 2026

Related benchmarks

TaskDatasetResultRank
Target Speaker ExtractionLibri2Mix Clean--
14
Target Speaker ExtractionWSJ0-2Mix
SI-SDRi (dB)23.44
8
Target Speaker ExtractionESD (test)
SI-SDRi (dB)16.67
8
Showing 3 of 3 rows

Other info

Follow for update