Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

About

Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.

Xiaoxue Gao, Nancy F. Chen• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER5.23	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.32	1410
Automatic Speech Recognition	LibriSpeech (dev-other)	WER5.13	535
Speech Recognition	LibriSpeech clean (dev)	WER0.0216	125

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord