Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agentic Tool-Use on ToolSandbox Multi-Tool
Loading...
53.7
TS-M Score
GEAR
26.868
33.834
40.8
47.766
May 12, 2026
TS-M Score
Updated 21d ago
Evaluation Results
Method
Method
Links
TS-M Score
GEAR
Backbone=Qwen3-8B, Eva...
2026.05
53.7
GEAR
Backbone=Qwen3-4B, Eva...
2026.05
42.3
GRPO
Backbone=Qwen3-8B, Eva...
2026.05
38.3
ARPO
Backbone=Qwen3-8B, Eva...
2026.05
37.9
OPSD+RL
Backbone=Qwen3-8B, Eva...
2026.05
36.8
ARPO
Backbone=Qwen3-4B, Eva...
2026.05
36.7
MT-GRPO
Backbone=Qwen3-8B, Eva...
2026.05
36.5
Base
Backbone=Qwen3-8B, Eva...
2026.05
35.8
GRPO
Backbone=Qwen3-4B, Eva...
2026.05
34.6
GEAR
Backbone=Qwen3-4B, Eva...
2026.05
34.3
OPSD+RL
Backbone=Qwen3-4B, Eva...
2026.05
33.7
MT-GRPO
Backbone=Qwen3-4B, Eva...
2026.05
33.2
GRPO
Backbone=Qwen3-4B, Eva...
2026.05
31.5
OPSD
Backbone=Qwen3-8B, Eva...
2026.05
31.2
Base
Backbone=Qwen3-4B, Eva...
2026.05
30.4
OPSD
Backbone=Qwen3-4B, Eva...
2026.05
27.9
Feedback
Search any
task
Search any
task