Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Tool Use Reasoning on Tool use
Loading...
68
Avg Accuracy @16 (1h)
SDPO (on-policy)
38.152
45.901
53.65
61.399
Jan 28, 2026
Avg Accuracy @16 (1h)
Avg Accuracy @16 (5h)
Updated 4d ago
Evaluation Results
Method
Method
Links
Avg Accuracy @16 (1h)
Avg Accuracy @16 (5h)
SDPO (on-policy)
Backbone=Qwen3-8B, Alg...
2026.01
68
68.5
GRPO
Backbone=Qwen3-8B, Alg...
2026.01
64.9
67.7
SDPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
60.8
62.1
GRPO (on-policy)
Backbone=Qwen3-8B, Alg...
2026.01
60.2
65.7
Qwen3-8B
Backbone=Qwen3-8B, Alg...
2026.01
57.5
-
GRPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
56.8
60.6
GRPO
Backbone=Olmo3-7B-Inst...
2026.01
56.4
65
Olmo3-7B-Instruct
Backbone=Olmo3-7B-Inst...
2026.01
39.3
-
Feedback
Search any
task
Search any
task