Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

About

User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.

Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun• 2024

Related benchmarks

TaskDatasetResultRank
Keyword SpottingLibriPhrase Easy (LPE)
EER0.57
25
Keyword SpottingLibriPhrase Hard (LPH)
EER0.0847
20
Keyword SpottingWSJ (test)
AP0.7052
12
Showing 3 of 3 rows

Other info

Follow for update