Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Segmental Attention Decoding With Long Form Acoustic Encodings

About

We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.

Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhuang• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionTED-LIUM3 (test)
WER0.039
55
Automatic Speech RecognitionLibriSpeech clean segmented (test)
WER1.7
10
Automatic Speech RecognitionLibriSpeech other segmented (test)
WER3.9
10
Automatic Speech RecognitionCommonVoice segmented (test)
WER11.4
10
Automatic Speech RecognitionEarnings21 long-form (test)
WER11.4
10
Showing 5 of 5 rows

Other info

Follow for update