COMET: A Neural Framework for MT Evaluation
About
We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three models with different types of human judgements: Direct Assessments, Human-mediated Translation Edit Rate and Multidimensional Quality Metrics. Our models achieve new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation Meta-evaluation | WMT Metrics Shared Task Segment-level 2023 (Primary submissions) | Avg Correlation0.622 | 33 | |
| Machine Translation Meta-evaluation | WMT MQM (En-De, En-Es, Ja-Zh) 24 | SPA82.4 | 28 | |
| Machine Translation Evaluation | WMT MQM Segment-level 22 | Score (En-De)59.4 | 19 | |
| Machine Translation Evaluation | WMT MQM System-level 22 | Overall Score83.9 | 19 | |
| Quality Estimation | Portuguese (pt-BR) dialect sentences (test) | Success Rate54 | 11 | |
| Quality Estimation | Mandarin (zh-CN) dialect sentences (test) | Success Rate53 | 11 | |
| Machine Translation Evaluation | WMT 2019 (test) | de-en0.219 | 10 | |
| Machine Translation | English-to-Czech | COMET-DA0.9 | 8 | |
| Machine Translation | English-to-German | COMET-DA60.5 | 8 | |
| Machine Translation | English-to-Russian | COMET-DA65.8 | 8 |