Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tau-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Performancetau-bench
Retail Accuracy78.3
55
Multi-turn tool-use interactiontau-bench
Retail Success Rate86.1
35
Stateful Agent-User Interactiontau-Bench Airline
Pass@139.1
22
LLM Agent Evaluationtau-bench Retail
Pass@164.8
22
Agentic Tool-useτ2-Bench (Tau-bench) Retail and Telecom
Overall Success Rate85.79
17
Tooltau2-Bench
Accuracy30.7
14
Tool-useTau-Bench
TAU-AIR Score67.5
14
Agentic Tool-UseTau-Bench
Retail Score71.3
13
Tool-use performanceTau-bench Retail (test)
Pass Rate66
12
Tool-use Task CompletionTau-Bench Airline v2 (test)
Pass Rate70
11
Tool Usetau-bench retail domain
Accuracy57
10
Tool Usetau-bench airline domain
Tool Use Accuracy (Airline)43.2
10
Agent Executiontau-Bench (test)
Execution Accuracy59
8
LLM Agent Evaluationtau-bench Airline
Accuracy42
7
Multi-turn agent decision makingtau-Bench (test)
Success Rate55.8
7
Tool Usetau-Bench
Pass@185.4
6
Ranking Preservationtau-bench Airline (test)
Mean Spearman Rho0.944
5
Function-callingTau-bench retail
Success Rate46
5
Function-callingTau-bench airline
Success Rate50
5
Agentic PerformanceTAU2-Bench
Success Rate85.4
5
Malicious Action Detectiontau-Bench retail (test)
TPR7.8
4
Showing 21 of 21 rows