Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Wanna hear your voice? A sample is all we need!

About

Research on audio clue-based target speaker extraction (TSE) has focused on modeling mixtures and reference speech, achieving strong results in English due to abundant datasets. However, cross-lingual properties remain underexplored, as low-resource languages face challenges from limited annotated data and linguistic resources. To bridge this gap, we propose WHYV (Wanna Hear Your Voice), a cross-lingual TSE framework enabling zero-shot adaptation without fine-tuning. WHYV employs a frequency-modulated gating mechanism that dynamically adjusts the acoustic features of the target speaker, minimizing reliance on language-specific cues. Evaluations demonstrate state-of-the-art zero-shot performance: 13.8 dB (Libri2Mix mix-both), 18.1 dB (mix-clean), and 14.8 dB on Vietnamese data.

The Hieu Pham, Phuong Thanh Tran Nguyen, Xuan Tho Nguyen, Tan Dat Nguyen, Duc Dung Nguyen• 2024

Related benchmarks

TaskDatasetResultRank
Target Speaker ExtractionLibri2Mix Noisy (test)
SI-SDR13.3
5
Target Speaker ExtractionVietnamese zero-shot
SI-SDR14.6
5
Target Speaker ExtractionAISHELL zero-shot Clean
SI-SDR13.4
5
Target Speaker ExtractionAISHELL Noisy zero-shot
SI-SDR10.2
5
Target Speech ExtractionLibri2Mix noisy
SI-SDR13.3
2
Showing 5 of 5 rows

Other info

Follow for update