MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task

About

In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, Markus Freitag• 2024

Related benchmarks

Task	Dataset	Result
Speech Translation Evaluation	Must-C	Pearson Correlation0.9615	94
Speech Translation Metric Evaluation	Europarl-ST (test)	Average Correlation0.915	84
Translation Evaluation	Met-BOUQuET XSTS+R+P (test)	Spearman's rho0.505	38
Machine Translation Meta-evaluation	WMT Metrics Shared Task Segment-level 2023 (Primary submissions)	Avg Correlation0.682	33
Machine Translation Meta-evaluation	MENT ZH-EN	Meta Score56.2	30
Machine Translation Meta-evaluation	MENT EN-ZH	Meta Score56.2	30
Machine Translation Meta-evaluation	WMT MQM (En-De, En-Es, Ja-Zh) 24	SPA85.6	28
Machine Translation Evaluation Metric	WMT MQM 23	Acc90.7	27
Machine Translation Evaluation	WMT MQM Segment-level 22	Score (En-De)60.1	19
Machine Translation Evaluation	WMT MQM System-level 22	Overall Score85	19

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord