Segmental Attention Decoding With Long Form Acoustic Encodings

About

We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.

Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhuang• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	TED-LIUM3 (test)	WER0.039	59
Automatic Speech Recognition	LibriSpeech clean segmented (test)	WER1.7	10
Automatic Speech Recognition	LibriSpeech other segmented (test)	WER3.9	10
Automatic Speech Recognition	CommonVoice segmented (test)	WER11.4	10
Automatic Speech Recognition	Earnings21 long-form (test)	WER11.4	10

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord