Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Span-level Machine Translation Error Detection on WMT MQM EN-DE 2023 (test)
Loading...
39.02
Precision
MQM #2
30.8664
32.9832
35.1
37.2168
Mar 20, 2026
Precision
Recall
F1 Score
Updated 27d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
MQM #2
type=human evaluator
2026.03
39.02
37.47
38.23
Sonnet 4.5
version=4.5
2026.03
38.92
27.8
32.44
MQM #1
type=human evaluator
2026.03
38.04
35.51
36.73
Haiku 4.5
version=4.5
2026.03
37.79
18.54
24.87
gpt-oss 120b
parameters=120b
2026.03
32.33
24.6
27.94
Qwen3 235b
parameters=235b
2026.03
31.18
30.41
30.79
Feedback
Search any
task
Search any
task