Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation
About
Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Latency Metric Evaluation | IWSLT tst-COMMON (w/o degenerate simultaneous policy) 2022 2023 | -- | 2 | |
| Latency Metric Evaluation | IWSLT En-De tst-COMMON w/o degenerate 2022/2023 | -- | 2 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST All language pairs | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST (En-De) | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST En-Zh | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST En-Ja | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST Cs-En | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST Different Team | -- | 1 | |
| Latency Metric Accuracy Evaluation | Long-form SimulST Same Team | -- | 1 | |
| Latency Metric Evaluation | IWSLT tst-COMMON All system pairs 2022 2023 (All) | -- | 1 |