Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Translator-call mode selection on PolyMath Medium
Loading...
75.22
Macro F1
LUAR
37.7904
47.5077
57.225
66.9423
Jun 1, 2026
Macro F1
Updated 1d ago
Evaluation Results
Method
Method
Links
Macro F1
LUAR
Backbone=Qwen3-4B
2026.06
75.22
ST(qr)
Backbone=Qwen3-8B
2026.06
69.19
BOUNDARY-SFT
Backbone=Qwen3-8B
2026.06
65.9
LUAR
Backbone=Qwen3-8B
2026.06
65.61
ST(qr)
Backbone=Qwen3-4B
2026.06
64.59
BOUNDARY-SFT
Backbone=Qwen3-4B
2026.06
62.22
SELF-ASSESSMENT
Backbone=Qwen3-4B
2026.06
56.44
ST(q)
Backbone=Qwen3-8B
2026.06
56.18
SELF-ASSESSMENT
Backbone=Qwen3-8B
2026.06
54.28
ST(q)
Backbone=Qwen3-4B
2026.06
52.59
NATIVE-TOOL-USE
Backbone=Qwen3-4B
2026.06
49.61
NATIVE-TOOL-USE
Backbone=Qwen3-8B
2026.06
39.23
Feedback
Search any
task
Search any
task