Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

τ-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic Reasoningτ-Bench
Score62.58
100
Tool-useτ-Bench
Average Pass@167.42
38
Long-context Reasoning∞ Bench
Accuracy90.39
32
Agent Task Completionτ-BENCH (test)
Average Task Reward0.791
27
Tool-use Agent Performanceτ²-bench
Retail Success Rate82.5
19
User Simulator Goal Alignmentτ-Bench Retail (test)
User Profile Success Rate94.5
14
User simulator goal alignmentτ-Bench Retail
User Profile Adherence94.5
14
User simulator goal alignmentτ-Bench Airline
User Profile Alignment (Prof.)98.7
14
Tool Use Reasoningτ-Bench
Avg Accuracy63.9
14
Long-context language tasks (MC, QA, Sum)∞Bench
MC Accuracy78.6
13
Long-context Question Answering∞Bench
Accuracy78.46
13
Question Answering∞-Bench Longbook QA English (test)
Tokens4,096
9
Tool Useτ²-Bench (out-of-distribution)
Retail Score54.9
8
Task-oriented Dialogueτ-bench 157 scenarios
Collaboration SR45.5
7
Agentic Dialogueτ-Bench (test)
Retail Accuracy60.4
7
Web Agent Task Successτ-bench
Retail Success Rate76.52
6
Agentic Tool Useτ²-Bench Telecom
Accuracy99.3
5
Agentic Tool Useτ²-Bench Airline
Accuracy67.5
5
Agentic Tool Useτ²-Bench Retail
Accuracy84.7
5
Agent & OpenClawτ²-Bench
Accuracy76.6
5
Tool-use Agent Robustnessτ-bench
Behavioral Uncertainty (BU)6.9
5
Tool-useτ-bench
τ-bench Score38.3
4
Tool-callingτ-Bench (test)
TSR43.77
4
Tool-use Agent Evaluationτ-bench retail domain (All 115 tasks)
Pass@143.9
4
Tool-use Agent Evaluationτ-bench retail domain (Last 105 tasks)
Pass@142.5
4
Showing 25 of 32 rows