Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

About

For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct extensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.

Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho• 2025

Related benchmarks

Task	Dataset	Result
Keyword Spotting	LibriPhrase Easy (LPE)	EER1.33	51
Speaker-Independent Keyword Spotting	LibriPhrase hard	AUROC88.71	21
Zero-shot Keyword Spotting	LibriPhrase Easy (LPE) Low phonetic confusion other-500 (train)	AUC99.86	9
Zero-shot Keyword Spotting	LibriPhrase Hard High phonetic confusion (train-other-500)	AUC88.71	9
Text-enrolled Keyword Spotting	LibriPhrase hard	EER21.04	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord