Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

About

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.

Peter Pol\'ak, Sara Papi, Luisa Bentivogli, Ond\v{r}ej Bojar• 2025

Related benchmarks

Task	Dataset	Result
Latency Metric Evaluation	IWSLT tst-COMMON (w/o degenerate simultaneous policy) 2022 2023	--	2
Latency Metric Evaluation	IWSLT En-De tst-COMMON w/o degenerate 2022/2023	--	2
Latency Metric Accuracy Evaluation	Long-form SimulST All language pairs	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST (En-De)	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST En-Zh	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST En-Ja	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST Cs-En	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST Different Team	--	1
Latency Metric Accuracy Evaluation	Long-form SimulST Same Team	--	1
Latency Metric Evaluation	IWSLT tst-COMMON All system pairs 2022 2023 (All)	--	1

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord