Keyword Transformer: A Self-Attention Model for Keyword Spotting

About

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

Axel Berg, Mark O'Connor, Miguel Tairum Cruz• 2021

Related benchmarks

Task	Dataset	Result
Keyword Spotting	Google Speech Commands V2-35	Accuracy97.74	42
Keyword Spotting	Google Speech Commands V2-12 2018	Accuracy98.56	16
Keyword Spotting	Google Speech Commands 12 V2 (Official)	Accuracy98.54	8
Keyword Spotting	Far-field Command (test)	Accuracy (Clean)93.47	8

Showing 4 of 4 rows

Other info

Code

Follow for update

@wizwand_team Discord