Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Translator-call mode selection on PolyMath High
Loading...
73.74
Macro F1
LUAR
44.4744
52.0722
59.67
67.2678
Jun 1, 2026
Macro F1
Updated 1d ago
Evaluation Results
Method
Method
Links
Macro F1
LUAR
Backbone=Qwen3-4B
2026.06
73.74
LUAR
Backbone=Qwen3-8B
2026.06
69.74
BOUNDARY-SFT
Backbone=Qwen3-4B
2026.06
66.85
ST(qr)
Backbone=Qwen3-8B
2026.06
60.73
BOUNDARY-SFT
Backbone=Qwen3-8B
2026.06
60.54
ST(q)
Backbone=Qwen3-8B
2026.06
60.41
ST(qr)
Backbone=Qwen3-4B
2026.06
58.24
ST(q)
Backbone=Qwen3-4B
2026.06
53.58
SELF-ASSESSMENT
Backbone=Qwen3-4B
2026.06
52.93
SELF-ASSESSMENT
Backbone=Qwen3-8B
2026.06
48.6
NATIVE-TOOL-USE
Backbone=Qwen3-4B
2026.06
48.54
NATIVE-TOOL-USE
Backbone=Qwen3-8B
2026.06
45.6
Feedback
Search any
task
Search any
task