Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-Level Speaker Representation for Target Speaker Extraction

About

Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.

Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li• 2024

Related benchmarks

TaskDatasetResultRank
Target Speaker ExtractionLibri2Mix
SI-SNRi (dB)16.08
4
Showing 1 of 1 rows

Other info

Follow for update