Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task

About

We present the joint contribution of Unbabel and Instituto Superior T\'ecnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.

Ricardo Rei, Nuno M. Guerreiro, Jos\'e Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, Jos\'e G.C. de Souza, Andr\'e F.T. Martins• 2023

Related benchmarks

Task	Dataset	Result
Machine Translation Meta-evaluation	WMT MQM (En-De, En-Es, Ja-Zh) 24	SPA85.4	28
Machine Translation Ranking	NT20 En→Zh	Accuracy66.49	11
Machine Translation Ranking	GenMT MQM En→De 22	Accuracy61.2	11
Machine Translation Ranking	GenMT22 MQM En→Ru	Accuracy67.12	11
Machine Translation Ranking	NT20 Zh→En	Accuracy57.82	11
Machine Translation Ranking	GenMT22 (MQM) Zh→En	Accuracy61.6	11
Machine Translation Ranking	Seed-X-Challenge Zh↔En	Accuracy46.72	11
Machine Translation Quality Estimation	WMT Metrics Shared Task EN–DE 2023	Average Correlation0.764	10
Machine Translation Ranking	Gemini-annotated held-out Zh↔En (test)	Accuracy72.01	10
Quality Estimation	EN-* Gender-ambiguous Fem. vs. Masc.	QE Score (ES)93.98	10

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord