Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Agentic Tool-use benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Agentic Tool-use
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
Tau2-Bench
Seed 2.0
Retail Score
90.4
59
23h ago
tau2-bench Airline
CODEDELEGATOR
Pass@1
63.5
30
9d ago
tau2-bench Retail
CODEDELEGATOR
Pass@1
82
30
9d ago
Agentic Macro-aggregate
Uno-Orchestra
Pass@1
70.3
22
27d ago
AppWorld (Challenge)
RCL (all primitives)
TGC
83.7
20
1mo ago
AppWorld Normal
RCL (all primitives)
Task Goal Completion (TGC)
89.3
20
1mo ago
τ²-Bench Telecom
GPT 5.4
Accuracy
100
18
15d ago
τ2-Bench (Tau-bench) Retail and Telecom
Claude-Opus-4.5
Overall Success Rate
85.79
17
3mo ago
ACEBench
GEAR
ACE-E Score
37.5
16
2d ago
BFCL
GEAR
BFCL Average Score
94.1
16
21d ago
ToolSandbox Multi-Tool
GEAR
TS-M Score
53.7
16
21d ago
Tau-Bench
Qwen3-235B-Instruct-2507
Retail Score
71.3
13
3mo ago
ACEBench (agent-task)
Gemini-2.5-Pro
Multi Turn Success Rate
97.5
13
3mo ago
tau^2 Bench official evaluation setting GPT-4.1 simulator
REACT(GPT-5)
Retail Score
0.775
9
3mo ago
Tau2-Telecom
LongCat-Flash-Lite
Avg@8
72.8
8
2mo ago
Tau2 Retail
LongCat-Next
Avg@8
73.68
8
2mo ago
Tau2-Airline
LongCat-Flash-Lite
Avg@8
58
8
2mo ago
τ²-Bench Airline
Qwen3.5-27B
Accuracy
67.5
5
1mo ago
τ²-Bench Retail
Qwen3.5-27B
Accuracy
84.7
5
1mo ago
SpreadSheetBench
Gemini 3-Pro
Success Rate
55.36
5
3mo ago
Agentic/tool BFCL and Tool-2
DASD
Overall Score
0.711
4
12d ago
tau2-Bench
PivotRL
Accuracy
64
2
2mo ago
Showing 22 of 22 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs