Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Tool Use on Tool-use domain Aggregate
Loading...
0.79
AvgZ Score
Agent-Dice
-2.954
-1.982
-1.01
-0.038
Jan 7, 2026
AvgZ Score
Updated 4d ago
Evaluation Results
Method
Method
Links
AvgZ Score
Agent-Dice
Base Model=Qwen3-8B
2026.01
0.79
Agent-Dice
Base Model=Llama-3.1-8B
2026.01
0.51
CL from Subset 0, 1 & 2
Base Model=Qwen3-8B, L...
2026.01
0.48
Learn from Subset 0
Base Model=Llama-3.1-8B
2026.01
0.45
CL from Subset 0 & 1
Base Model=Qwen3-8B, L...
2026.01
0.44
CL from Subset 0 & 1
Base Model=Llama-3.1-8B
2026.01
0.4
Learn from Subset 3
Base Model=Llama-3.1-8B
2026.01
0.39
CL from all Subsets
Base Model=Llama-3.1-8B
2026.01
0.38
Learn from Subset 2
Base Model=Llama-3.1-8B
2026.01
0.29
Learn from Subset 3
Base Model=Qwen3-8B, L...
2026.01
0.28
Learn from Subset 0
Base Model=Qwen3-8B, L...
2026.01
0.27
CL from Subset 0, 1 & 2
Base Model=Llama-3.1-8B
2026.01
0.26
Learn from Subset 1
Base Model=Qwen3-8B, L...
2026.01
0.13
Learn from Subset 1
Base Model=Llama-3.1-8B
2026.01
0.13
CL from all Subsets
Base Model=Qwen3-8B, L...
2026.01
0.06
Learn from Subset 2
Base Model=Qwen3-8B, L...
2026.01
-1.02
Zero-Shot
Base Model=Qwen3-8B, L...
2026.01
-1.42
Zero-Shot
Base Model=Llama-3.1-8B
2026.01
-2.81
Feedback
Search any
task
Search any
task