ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody
About
Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Keyword Spotting | LibriPhrase Easy (LPE) | EER0.63 | 25 | |
| Keyword Spotting | LibriPhrase Hard (LPH) | EER0.0752 | 20 | |
| Keyword Spotting | Wenet-Phrase (WPE) | AUC99.81 | 2 | |
| Keyword Spotting | Accent-KWS (AC) | AUC71.45 | 2 | |
| Keyword Spotting | Intent-KWS IT | AUC86.42 | 2 | |
| Keyword Spotting | Wenet-Phrase (WPH) | AUC84.82 | 2 |