No Word Left Behind: Mitigating Prefix Bias in Open-Vocabulary Keyword Spotting

About

Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.

Yi Liu, Chuan-Che Huang, Xiao Quan• 2026

Related benchmarks

Task	Dataset	Result
Open-vocabulary keyword spotting	LibriPhrase easy	EER0.0182	11
Open-vocabulary keyword spotting	LibriPhrase hard	EER13.7	6
Open-vocabulary keyword spotting	POB-Spark	EER16.15	6
Open-vocabulary keyword spotting	Google Speech Commands (GSC)	EER8.87	6
Open-vocabulary keyword spotting	POB-LP	Accuracy99.42	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord