Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Span-level Machine Translation Error Detection on WMT MQM (ZH-EN) 2023 (test)
Loading...
50.25
Precision
Haiku 4.5
39.7148
42.4499
45.185
47.9201
Mar 20, 2026
Precision
Recall
F1 Score
Updated 27d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
Haiku 4.5
version=4.5
2026.03
50.25
25.66
33.97
Sonnet 4.5
version=4.5
2026.03
48.82
33.94
40.04
MQM #2
type=human evaluator
2026.03
44.57
39.82
42.06
gpt-oss 120b
parameters=120b
2026.03
44.55
29.1
35.2
MQM #1
type=human evaluator
2026.03
40.17
39.48
39.82
Qwen3 235b
parameters=235b
2026.03
40.12
39.5
39.81
Feedback
Search any
task
Search any
task