Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

About

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, V\'aclav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick P\'erez, Laurent Mazar\'e, Alexandre D\'efossez• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS en (test)
WER1.34
90
Text-to-SpeechLibriSpeech PC clean (test)
WER1.63
31
Text-to-SpeechEmergentTTS (eval)
Overall WER9.1
25
Automatic Speech RecognitionTED-LIUM
WER2.9
18
Speech RecognitionRev 16
WER12.3
9
Text-to-SpeechEmilia EN speaking-rate
MUSHRA Score60.5
9
Automatic Speech RecognitionEarnings21
WER10.6
5
Automatic Speech RecognitionMeanwhile
WER5.7
5
Showing 8 of 8 rows

Other info

Follow for update