Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task
About
We present the joint contribution of Unbabel and Instituto Superior T\'ecnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation Meta-evaluation | WMT MQM (En-De, En-Es, Ja-Zh) 24 | SPA85.4 | 28 | |
| Machine Translation Ranking | NT20 En→Zh | Accuracy66.49 | 11 | |
| Machine Translation Ranking | GenMT MQM En→De 22 | Accuracy61.2 | 11 | |
| Machine Translation Ranking | GenMT22 MQM En→Ru | Accuracy67.12 | 11 | |
| Machine Translation Ranking | NT20 Zh→En | Accuracy57.82 | 11 | |
| Machine Translation Ranking | GenMT22 (MQM) Zh→En | Accuracy61.6 | 11 | |
| Machine Translation Ranking | Seed-X-Challenge Zh↔En | Accuracy46.72 | 11 | |
| Machine Translation Ranking | Gemini-annotated held-out Zh↔En (test) | Accuracy72.01 | 10 | |
| Quality Estimation | En-Ml | Pearson r0.454 | 9 |