Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use instruction following on ToolBench v1 (Evaluation)
Loading...
71.7
Success Rate
AgentMark
49.132
54.991
60.85
66.709
Jan 5, 2026
Success Rate
Total Steps
BPS Rate
BPT Rate
Delta Steps per Step
Delta Tokens per Step (%)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Success Rate
Total Steps
BPS Rate
BPT Rate
Delta Steps per Step
Delta Tokens per Step (%)
AgentMark
Task=T4: Multi-tool wi...
2026.01
71.7
6.5
0.49
5.1
-1.54
10.97
AgentMark
Task=T2: Single-tool i...
2026.01
61.7
8.6
0.48
4.89
-1.41
-16.96
AgentMark
Task=T1: Single-tool s...
2026.01
60
5.6
0.51
5.28
-1.67
9.64
AgentMark
Task=T3: Single-tool w...
2026.01
60
4.8
0.46
4.62
-2.4
-3.29
AgentMark
Task=Avg.
2026.01
59.7
7.2
0.49
4.93
-1.27
-6.25
AgentMark
Task=T6: Complex multi...
2026.01
55
9.6
0.49
4.9
0.24
-14.77
AgentMark
Task=T5: Multi-tool in...
2026.01
50
7.8
0.48
4.85
-0.83
5.24
Feedback
Search any
task
Search any
task