Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Tool Use on Tool use
Loading...
68.5
Avg@16
SDPO (on-policy)
38.132
46.016
53.9
61.784
Jan 28, 2026
Avg@16
Updated 4d ago
Evaluation Results
Method
Method
Links
Avg@16
SDPO (on-policy)
Backbone=Qwen3-8B, Tra...
2026.01
68.5
GRPO
Backbone=Qwen3-8B, Tra...
2026.01
68.1
GRPO (on-policy)
Backbone=Qwen3-8B, Tra...
2026.01
68.1
GRPO
Backbone=Olmo3-7B-Inst...
2026.01
65
SDPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
62.5
GRPO
Backbone=Qwen3-8B, Tra...
2026.01
61.7
GRPO (on-policy)
Backbone=Qwen3-8B, Tra...
2026.01
61.7
GRPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
61.3
Qwen3-8B
Training Time=0h
2026.01
57.5
SDPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
57.3
SDPO (on-policy)
Backbone=Qwen3-8B, Tra...
2026.01
56.4
GRPO
Backbone=Olmo3-7B-Inst...
2026.01
56.4
GRPO (on-policy)
Backbone=Olmo3-7B-Inst...
2026.01
56
Olmo3-7B-Instruct
Training Time=0h
2026.01
39.3
Feedback
Search any
task
Search any
task