Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

About

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer• 2022

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.47
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER5.97
1151
Automatic Speech RecognitionLibriSpeech (dev-other)
WER5.77
462
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)2.27
340
Automatic Speech RecognitionLibrispeech (test-clean)
WER2.47
84
Automated Speech RecognitionTED-LIUM V3
WER23.5
77
Automatic Speech RecognitionEarnings-22
WER53.44
29
Long-form TranscriptionEarnings-21
WER38.09
26
Automatic Speech RecognitionTelephony
WER11.72
7
Automatic Speech RecognitionReading
WER5.2
7
Showing 10 of 12 rows

Other info

Code

Follow for update