ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

About

Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

Jianan Pan, Yuanming Zhang, Kejie Huang• 2026

Related benchmarks

Task	Dataset	Result
Keyword Spotting	LibriPhrase Easy (LPE)	EER0.63	51
Keyword Spotting	LibriPhrase Hard (LPH)	EER0.0752	25
Keyword Spotting	Wenet-Phrase (WPE)	AUC99.81	2
Keyword Spotting	Accent-KWS (AC)	AUC71.45	2
Keyword Spotting	Intent-KWS IT	AUC86.42	2
Keyword Spotting	Wenet-Phrase (WPH)	AUC84.82	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord