PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

About

This study presents a novel zero-shot user-defined keyword spotting model that utilizes the audio-phoneme relationship of the keyword to improve performance. Unlike the previous approach that estimates at utterance level, we use both utterance and phoneme level information. Our proposed method comprises a two-stream speech encoder architecture, self-attention-based pattern extractor, and phoneme-level detection loss for high performance in various pronunciation environments. Based on experimental results, our proposed model outperforms the baseline model and achieves competitive performance compared with full-shot keyword spotting models. Our proposed model significantly improves the EER and AUC across all datasets, including familiar words, proper nouns, and indistinguishable pronunciations, with an average relative improvement of 67% and 80%, respectively. The implementation code of our proposed model is available at https://github.com/ncsoft/PhonMatchNet.

Yong-Hyeok Lee, Namhyun Cho• 2023

Related benchmarks

Task	Dataset	Result
Keyword Spotting	Google Speech Commands (test)	Accuracy96.8	71
Keyword Spotting	LibriPhrase Easy (LPE)	EER2.33	46
Speaker-Independent Keyword Spotting	LibriPhrase hard	AUROC88.52	21
Speaker-Independent Keyword Spotting	Google Speech Commands (GSC)	AUROC98.11	12
Open-vocabulary keyword spotting	LibriPhrase easy	EER0.028	11
Speaker-Independent Keyword Spotting	Qualcomm Keyword Speech (Qcomm)	AUROC98.9	10
Zero-shot Keyword Spotting	LibriPhrase Hard High phonetic confusion (train-other-500)	AUC88.52	9
Zero-shot Keyword Spotting	LibriPhrase Easy (LPE) Low phonetic confusion other-500 (train)	AUC99.29	9
Zero-shot Keyword Spotting	Google Speech Commands G V2	AUC98.11	6
Zero-shot Keyword Spotting	Qualcomm Keyword Speech Q (evaluation)	AUC98.9	6

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord