SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting
About
User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82$ to $11.1$.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-vocabulary keyword spotting | Google Speech Commands (GSC) | EER8 | 6 | |
| Open-vocabulary keyword spotting | LibriPhrase easy | EER0.0214 | 6 | |
| Open-vocabulary keyword spotting | LibriPhrase hard | EER14.3 | 6 | |
| Open-vocabulary keyword spotting | POB-Spark | EER29.23 | 6 | |
| Open-vocabulary keyword spotting | POB-LP | Accuracy98.7 | 6 |