SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

About

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya• 2023

Related benchmarks

Task	Dataset	Result	Rank
Automatic Speech Recognition	LibriSpeech clean (test)	WER4.85		1207
Speech Recognition	Librispeech other (test)	WER12.97		105

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord