Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Span-level machine translation error detection on MQM EN-ZH annotations 2024 (test)
Loading...
35.95
Precision
MQM #2
28.5972
30.5061
32.415
34.3239
Mar 20, 2026
Precision
Recall
F1 Score
Updated 27d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
MQM #2
type=human evaluator
2026.03
35.95
25.68
29.96
Sonnet 4.5
version=4.5
2026.03
34.31
21.57
26.49
MQM #1
type=human evaluator
2026.03
29.88
25.49
27.51
Haiku 4.5
version=4.5
2026.03
29.88
15.15
20.11
gpt-oss 120b
parameters=120b
2026.03
28.89
21.42
24.6
Qwen3 235b
parameters=235b
2026.03
28.88
30.92
29.87
Feedback
Search any
task
Search any
task