Wanna hear your voice? A sample is all we need!

About

Research on audio clue-based target speaker extraction (TSE) has focused on modeling mixtures and reference speech, achieving strong results in English due to abundant datasets. However, cross-lingual properties remain underexplored, as low-resource languages face challenges from limited annotated data and linguistic resources. To bridge this gap, we propose WHYV (Wanna Hear Your Voice), a cross-lingual TSE framework enabling zero-shot adaptation without fine-tuning. WHYV employs a frequency-modulated gating mechanism that dynamically adjusts the acoustic features of the target speaker, minimizing reliance on language-specific cues. Evaluations demonstrate state-of-the-art zero-shot performance: 13.8 dB (Libri2Mix mix-both), 18.1 dB (mix-clean), and 14.8 dB on Vietnamese data.

The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen• 2024

Related benchmarks

Task	Dataset	Result
Target Speaker Extraction	Libri2Mix Noisy (test)	SI-SDR13.3	5
Target Speaker Extraction	Vietnamese zero-shot	SI-SDR14.6	5
Target Speaker Extraction	AISHELL zero-shot Clean	SI-SDR13.4	5
Target Speaker Extraction	AISHELL Noisy zero-shot	SI-SDR10.2	5
Target Speech Extraction	Libri2Mix noisy	SI-SDR13.3	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord