Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tau-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic tool-useTau2-Bench
Retail Score90.4
59
Agent Performancetau-bench
Retail Accuracy78.3
55
LLM Agent Evaluationtau-bench Retail
Pass@164.8
38
Multi-turn tool-use interactiontau-bench
Retail Success Rate86.1
35
LLM Agent Evaluationtau-bench Airline
Pass@478
29
Stateful Agent-User Interactiontau-Bench Airline
Pass@139.1
22
Agentic PerformanceTAU2-Bench
Success Rate85.4
20
Agentic Tool-useτ2-Bench (Tau-bench) Retail and Telecom
Overall Success Rate85.79
17
Tooltau2-Bench
Accuracy30.7
14
Tool-useTau-Bench
TAU-AIR Score67.5
14
Agentic Tool-UseTau-Bench
Retail Score71.3
13
Tool-use performanceTau-bench Retail (test)
Pass Rate66
12
Tool-use Task CompletionTau-Bench Airline v2 (test)
Pass Rate70
11
Agent PerformanceTau-bench Telecom
Avg@4 Score45.39
10
Agent PerformanceTau-bench Retail
Avg@460.31
10
Original Tasksextended tau-Bench Retail domain
Pass@190
10
Sister Tasksextended tau-Bench Retail domain
Pass@1 Rate83
10
Original Taskstau-Bench Airline domain extended
Pass Rate @ Threshold 170
10
Sister Tasksextended tau-Bench Airline domain
Pass@1 Rate74
10
Tool Usetau-bench retail domain
Accuracy57
10
Tool Usetau-bench airline domain
Tool Use Accuracy (Airline)43.2
10
Agent Executiontau-Bench (test)
Execution Accuracy59
8
Agentic FunctionTau2-Bench
Score81.2
7
Multi-turn agent decision makingtau-Bench (test)
Success Rate55.8
7
Tool Usetau-Bench
Pass@185.4
6
Showing 25 of 33 rows