Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning When to Translate for Streaming Speech

About

How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst.

Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li• 2021

Related benchmarks

TaskDatasetResultRank
Speech TranslationMuST-C EN-DE (test-COMMON)
BLEU24.9
41
Simultaneous Speech TranslationMuST-C EN-DE (tst-COMMON)
BLEU20
39
Speech TranslationMuST-C EN-FR COMMON (test)
BLEU35.3
17
Showing 3 of 3 rows

Other info

Code

Follow for update