Multi-Head State Space Model for Speech Recognition

About

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.91	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER4.36	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER4.37	486
Speech Recognition	LibriSpeech clean (dev)	WER0.0176	104

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord