COMET: A Neural Framework for MT Evaluation

About

We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three models with different types of human judgements: Direct Assessments, Human-mediated Translation Edit Rate and Multidimensional Quality Metrics. Our models achieve new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.

Ricardo Rei, Craig Stewart, Ana C Farinha, Alon Lavie• 2020

Related benchmarks

Task	Dataset	Result
Speech Translation Evaluation	Must-C	Pearson Correlation0.9896	94
Speech Translation Metric Evaluation	Europarl-ST (test)	Average Correlation0.9857	84
Machine Translation Meta-evaluation	WMT Metrics Shared Task Segment-level 2023 (Primary submissions)	Avg Correlation0.622	33
Machine Translation Meta-evaluation	WMT MQM (En-De, En-Es, Ja-Zh) 24	SPA82.4	28
Machine Translation Evaluation	WMT 2019 (test)	de-en0.219	25
Machine Translation Evaluation	WMT MQM Segment-level 22	Score (En-De)59.4	19
Machine Translation Evaluation	WMT MQM System-level 22	Overall Score83.9	19
Machine Translation Meta-evaluation	WMT EN-UK 2025	Acc*Eq0.572	17
Machine Translation Meta-evaluation	WMT EN-CS 2025	Acc*Eq59.5	17
Machine Translation Meta-evaluation	WMT EN-ZH 2025	Acc*Eq54.4	17

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord