A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

About

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi• 2025

Related benchmarks

Task	Dataset	Result
Mispronunciation Detection	L2-ARCTIC (test)	F1 Score63.18	20
Mispronunciation Diagnosis	L2-ARCTIC (test)	EDR22.12	14
Phoneme Recognition	L2-ARCTIC (test)	Phoneme Error Rate (PER)18.07	14

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord