Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Accelerating Transducers through Adjacent Token Merging

About

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. In this way, the total time step could be reduced, and the inference of both the encoder and joint network is accelerated. Experiments on LibriSpeech show that our method can reduce 57% of tokens and improve the inference speed on GPU by 70% without any notable loss of accuracy. Additionally, we demonstrate that A-ToMe is also an effective solution to reduce tokens in long-form ASR, where the input speech consists of multiple utterances.

Yuang Li, Yu Wu, Jinyu Li, Shujie Liu• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.92
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER5.87
1151
Automatic Speech RecognitionLibriSpeech (dev-other)
WER6.04
462
Speech RecognitionLibriSpeech clean (dev)
WER0.0412
80
Automatic Speech RecognitionWenetSpeech Meeting (test)
CER12.97
78
Automatic Speech RecognitionWenetSpeech Net (test)
CER12.08
57
Automatic Speech RecognitionAISHELL-1
CER3.58
50
Automatic Speech RecognitionFleurs En
WER5.77
34
Automatic Speech RecognitionAISHELL-2
CER5.56
29
Automatic Speech RecognitionFleurs zh
CER5.08
26
Showing 10 of 10 rows

Other info

Follow for update