Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

About

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER4.85
1156
Speech RecognitionLibrispeech other (test)
WER12.97
105
Showing 2 of 2 rows

Other info

Follow for update