Streaming keyword spotting on mobile devices

About

In this work we explore the latency and accuracy of keyword spotting (KWS) models in streaming and non-streaming modes on mobile phones. NN model conversion from non-streaming mode (model receives the whole input sequence and then returns the classification result) to streaming mode (model receives portion of the input sequence and classifies it incrementally) may require manual model rewriting. We address this by designing a Tensorflow/Keras based library which allows automatic conversion of non-streaming models to streaming ones with minimum effort. With this library we benchmark multiple KWS models in both streaming and non-streaming modes on mobile phones and demonstrate different tradeoffs between latency and accuracy. We also explore novel KWS models with multi-head attention which reduce the classification error over the state-of-art by 10% on Google speech commands data sets V2. The streaming library with all experiments is open-sourced.

Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirko Visontai, Stella Laurenzo• 2020

Related benchmarks

Task	Dataset	Result
Keyword Spotting	Google Speech Commands (test)	Accuracy96.6	71
Keyword Spotting	Google Speech Commands v1 (test)	Accuracy97.2	68
Keyword Spotting	Google Speech Commands V2-35	Accuracy97.27	42
Keyword Spotting	Google Speech Commands V2 (test)	Accuracy98	41
Keyword Spotting	Google Speech Commands V2-12 2018	Accuracy98	16
Keyword Spotting	Google Speech Commands 12 V2 (Official)	Accuracy98.04	8
Keyword Spotting	Far-field Command (test)	Accuracy (Clean)78.83	8

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord