Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
About
Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | TED-LIUM | WER8.6 | 18 | |
| Speech-to-text Translation | IWSLT25Instruct en-de | BLEU37.67 | 10 | |
| Speech Recognition | Rev 16 | WER19.9 | 9 | |
| Speech-to-text Translation | IWSLT25Instruct en-ja | BLEU42.17 | 6 | |
| Automatic Speech Recognition | Meanwhile | WER16.2 | 5 | |
| Automatic Speech Recognition | Earnings21 | WER54.1 | 5 | |
| Speech-to-text Translation | IWSLT25Instruct en-ru | BLEU42.75 | 3 | |
| Speech-to-text Translation | HQ Podcasts en-ja | BLEU49.42 | 3 | |
| Speech-to-text Translation | HQ Podcasts en-de | BLEU43.7 | 3 | |
| Speech-to-text Translation | HQ Podcasts en-ru | BLEU40.12 | 3 |