DONUT: CTC-based Query-by-Example Keyword Spotting
About
Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, a CTC-based algorithm for online query-by-example keyword spotting that enables custom wakeword detection. The algorithm works by recording a small number of training examples from the user, generating a set of label sequence hypotheses from these training examples, and detecting the wakeword by aggregating the scores of all the hypotheses given a new audio recording. Our method combines the generalization and interpretability of CTC-based keyword spotting with the user-adaptation and convenience of a conventional query-by-example system. DONUT has low computational requirements and is well-suited for both learning and inference on embedded systems without requiring private user data to be uploaded to the cloud.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Keyword Spotting | LibriPhrase Easy (LPE) | EER28.74 | 46 | |
| Speaker-Independent Keyword Spotting | LibriPhrase hard | AUROC62.55 | 21 | |
| Speaker-Independent Keyword Spotting | Google Speech Commands (GSC) | AUROC92.09 | 12 | |
| Speaker-Independent Keyword Spotting | Qualcomm Keyword Speech (Qcomm) | AUROC50.13 | 10 |