Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Simultaneous Speech-to-Speech Translation Without Aligned Data

About

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

Tom Labiausse, Romain Fabre, Yannick Est\`eve, Alexandre D\'efossez, Neil Zeghidour• 2026

Related benchmarks

TaskDatasetResultRank
Speech-to-speech translationEuroparl-ST short-form (test)
BLEU35
9
Speech-to-speech translationAudio-NTREX-4L long-form (test)
BLEU33.2
9
Speech-to-speech translationItalian-to-English short-form Evaluation Data (test)
BLEU32.1
4
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set French short-form
Audio Quality MOS64.5
3
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set Spanish short-form
Audio Quality (MOS)66.8
2
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set Portuguese short-form
Audio Quality62
2
Simultaneous Speech-to-Speech TranslationHuman Evaluation Set German short-form
Audio Quality73.5
2
Showing 7 of 7 rows

Other info

Follow for update