Multi-Head State Space Model for Speech Recognition
About
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.36 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.91 | 833 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER4.37 | 411 | |
| Speech Recognition | LibriSpeech clean (dev) | WER0.0176 | 59 |