Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Span-level Machine Translation Error Detection on WMT MQM (EN-DE) 2022 (test)
Loading...
42.66
Precision
MQM #1
22.8584
27.9992
33.14
38.2808
Mar 20, 2026
Precision
Recall
F1 Score
Updated 27d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
MQM #1
type=human evaluator
2026.03
42.66
44.62
43.62
MQM #2
type=human evaluator
2026.03
39.22
43.56
41.28
Sonnet 4.5
version=4.5
2026.03
32.61
29.75
31.12
Haiku 4.5
version=4.5
2026.03
30.07
19.89
23.94
gpt-oss 120b
parameters=120b
2026.03
24.08
31.04
27.12
Qwen3 235b
parameters=235b
2026.03
23.62
38.87
29.39
Feedback
Search any
task
Search any
task