Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

About

Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically decoded with our three modal representations. Experiments on LibriPhase and WenetPrase demonstrate the performance of our model. Compared to existing streaming approaches, our method achieves better performance with significantly fewer parameters.

Kewei Li, Yinan Zhong, Xiaotao Liang, Tianchi Dai, Shaofei Xue• 2025

Related benchmarks

Task	Dataset	Result
Keyword Spotting	LibriPhrase Easy (LPE)	EER5.77	51
Keyword Spotting	LibriPhrase Hard (LPH)	EER0.2729	25
Keyword Spotting	WenetPhrase Mandarin (Easy)	EER0.1456	6
Keyword Spotting	WenetPhrase Mandarin (Hard)	EER34.5	6

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord