Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

tau2-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score81.1
84
Task failure prediction and selective task completiontau2-bench Telecom 1.0
AUROC0.809
15
Task failure prediction and selective task completiontau2-bench Retail 1.0
AUROC0.707
15
Task failure prediction and selective task completiontau2-bench Airline 1.0
AUROC74.2
15
Long-Horizon Tool Executiontau2-Bench
Retail Success Rate75.2
12
Multi-turn agent decision makingtau2-Bench (test)
Success Rate22.3
7
Tool UseTau2-Bench
Success Rate57.4
6
Agentic Tool-usetau2-bench Airline
Pass@163.5
6
Agentic Tool-usetau2-bench Retail
Pass@182
6
Agentic Skill Acquisitiontau2-bench
Pass@176.7
5
Agent Task Successtau2-bench Retail Domain
Metric-
0
Agent Task Successtau2-bench Airline Domain
Metric-
0
Showing 12 of 12 rows