SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

About

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82$ to $11.1$.

Kumari Nishu, Minsik Cho, Devang Naik• 2024

Related benchmarks

Task	Dataset	Result
Keyword Spotting	LibriPhrase Easy (LPE)	EER1.78	46
Speaker-Independent Keyword Spotting	LibriPhrase hard	AUROC94.9	21
Open-vocabulary keyword spotting	LibriPhrase easy	EER0.0214	11
Open-vocabulary keyword spotting	Google Speech Commands (GSC)	EER8	6
Open-vocabulary keyword spotting	LibriPhrase hard	EER14.3	6
Open-vocabulary keyword spotting	POB-Spark	EER29.23	6
Open-vocabulary keyword spotting	POB-LP	Accuracy98.7	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord