Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

About

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.

Peter Pol\'ak, Sara Papi, Luisa Bentivogli, Ond\v{r}ej Bojar• 2025

Related benchmarks

TaskDatasetResultRank
Latency Metric EvaluationIWSLT tst-COMMON (w/o degenerate simultaneous policy) 2022 2023--
2
Latency Metric EvaluationIWSLT En-De tst-COMMON w/o degenerate 2022/2023--
2
Latency Metric Accuracy EvaluationLong-form SimulST All language pairs--
1
Latency Metric Accuracy EvaluationLong-form SimulST (En-De)--
1
Latency Metric Accuracy EvaluationLong-form SimulST En-Zh--
1
Latency Metric Accuracy EvaluationLong-form SimulST En-Ja--
1
Latency Metric Accuracy EvaluationLong-form SimulST Cs-En--
1
Latency Metric Accuracy EvaluationLong-form SimulST Different Team--
1
Latency Metric Accuracy EvaluationLong-form SimulST Same Team--
1
Latency Metric EvaluationIWSLT tst-COMMON All system pairs 2022 2023 (All)--
1
Showing 10 of 12 rows

Other info

Follow for update