Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

tau2-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score81.1
102
Agentic Tool-usetau2-bench Airline
Pass@163.5
22
Agentic Tool-usetau2-bench Retail
Pass@182
22
Task failure prediction and selective task completiontau2-bench Telecom 1.0
AUROC0.809
15
Task failure prediction and selective task completiontau2-bench Retail 1.0
AUROC0.707
15
Task failure prediction and selective task completiontau2-bench Airline 1.0
AUROC74.2
15
Long-Horizon Tool Executiontau2-Bench
Retail Success Rate75.2
12
Agentic Skill Acquisitiontau2-bench
Pass@181.2
9
Multi-turn agent decision makingtau2-Bench (test)
Success Rate22.3
7
General Task (Agentic Coding)tau2-Bench Telecom
Score98.2
6
Tool UseTau2-Bench
Success Rate57.4
6
Agentic Tool Usetau2-Bench
Accuracy64
2
Agent Task Successtau2-bench Retail Domain
Metric-
0
Agent Task Successtau2-bench Airline Domain
Metric-
0
Showing 14 of 14 rows