Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

About

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionTED-LIUM
WER8.6
18
Speech-to-text TranslationIWSLT25Instruct en-de
BLEU37.67
10
Speech RecognitionRev 16
WER19.9
9
Speech-to-text TranslationIWSLT25Instruct en-ja
BLEU42.17
6
Automatic Speech RecognitionMeanwhile
WER16.2
5
Automatic Speech RecognitionEarnings21
WER54.1
5
Speech-to-text TranslationIWSLT25Instruct en-ru
BLEU42.75
3
Speech-to-text TranslationHQ Podcasts en-ja
BLEU49.42
3
Speech-to-text TranslationHQ Podcasts en-de
BLEU43.7
3
Speech-to-text TranslationHQ Podcasts en-ru
BLEU40.12
3
Showing 10 of 10 rows

Other info

Follow for update