Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

About

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

Tom Kocmi, Christian Federmann• 2023

Related benchmarks

TaskDatasetResultRank
Machine Translation Meta-evaluationMENT ZH-EN
Meta Score77.2
30
Machine Translation Meta-evaluationMENT EN-ZH
Meta Score77.2
30
Machine Translation Meta-evaluationWMT MQM (En-De, En-Es, Ja-Zh) 24
SPA84.6
28
Machine Translation EvaluationWMT MQM System-level 22
Overall Score86.9
19
Machine Translation EvaluationWMT MQM Segment-level 22
Score (En-De)55.2
19
Machine Translation EvaluationWMT MQM 2022 (test)
Accuracy (System, 3 LPs)89.8
16
Machine Translation EvaluationMSLC OOD 24
MT Empty Score14
12
Personalized Text GenerationLongLaMP
Alignment Score69
7
Machine Translation Meta-evaluationWMT Zh-En (subset of 600 samples) 2022
Kendall Correlation0.4492
2
Showing 9 of 9 rows

Other info

Follow for update