Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

tau2-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score81.1
102
Agentic Tool-usetau2-bench Airline
Pass@163.5
30
Agentic Tool-usetau2-bench Retail
Pass@182
30
Task failure prediction and selective task completiontau2-bench Telecom 1.0
AUROC0.809
15
Task failure prediction and selective task completiontau2-bench Retail 1.0
AUROC0.707
15
Task failure prediction and selective task completiontau2-bench Airline 1.0
AUROC74.2
15
Stateful Interactiontau2-bench
Score86.69
12
Long-Horizon Tool Executiontau2-Bench
Retail Success Rate75.2
12
Agent Task Completiontau2-bench telecom
Pass Rate67
9
Agent Task Completiontau2-bench airline
Pass Rate60
9
Agent Task Successtau2-bench Retail Domain
Total Pass Rate61.4
9
Agentic Skill Acquisitiontau2-bench
Pass@181.2
9
Multi-turn agent decision makingtau2-Bench (test)
Success Rate22.3
7
General Task (Agentic Coding)tau2-Bench Telecom
Score98.2
6
Tool UseTau2-Bench
Success Rate57.4
6
User Simulation Behavioral Alignmenttau2-bench Retail + Airline (test)
HL Score95.8
5
User Simulation Behavioral Alignmenttau2-bench Airline (test)
HL90.3
5
Stateful service dialoguetau2-bench Telecom
Task Completion Score (TCS)85.1
4
Tool-use agent scalability and performanceTau2-Bench
Runtime1
2
Agentic Tool Usetau2-Bench
Accuracy64
2
Agent Task Successtau2-bench Airline Domain
Metric-
0
Showing 21 of 21 rows