Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Translator-call Mode Selection on PolyMath Low
Loading...
87.64
Macro F1 Score
LUAR
39.2384
51.8042
64.37
76.9358
Jun 1, 2026
Macro F1 Score
Updated 1d ago
Evaluation Results
Method
Method
Links
Macro F1 Score
LUAR
Backbone=Qwen3-4B
2026.06
87.64
ST(qr)
Backbone=Qwen3-8B
2026.06
82.92
ST(qr)
Backbone=Qwen3-4B
2026.06
79.26
LUAR
Backbone=Qwen3-8B
2026.06
78.32
BOUNDARY-SFT
Backbone=Qwen3-4B
2026.06
73.14
BOUNDARY-SFT
Backbone=Qwen3-8B
2026.06
69.2
ST(q)
Backbone=Qwen3-8B
2026.06
64.01
ST(q)
Backbone=Qwen3-4B
2026.06
63.25
NATIVE-TOOL-USE
Backbone=Qwen3-8B
2026.06
58.41
NATIVE-TOOL-USE
Backbone=Qwen3-4B
2026.06
57.6
SELF-ASSESSMENT
Backbone=Qwen3-8B
2026.06
43.48
SELF-ASSESSMENT
Backbone=Qwen3-4B
2026.06
41.1
Feedback
Search any
task
Search any
task