Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

τ-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic Reasoningτ-Bench
Score62.58
100
Tool-use Agent Performanceτ²-bench
ASR72.4
50
Tool-useτ-Bench
Average Pass@167.42
38
Long-context Reasoning∞ Bench
Accuracy90.39
32
Agent Task Completionτ-BENCH (test)
Average Task Reward0.791
27
User Simulator Goal Alignmentτ-Bench Retail (test)
User Profile Success Rate94.5
19
Conversational Tool-useτ²-Bench
Airline Success Rate75.5
18
Behavioral Similarity Analysisτ-Bench and τ2-Bench (test)
GED Score82.6
18
Agentic Tool Useτ²-Bench Telecom
Accuracy100
18
Question Answering∞-Bench Longbook QA English (test)
F1 Score11.2
18
User simulator goal alignmentτ-Bench Retail
User Profile Adherence94.5
14
User simulator goal alignmentτ-Bench Airline
User Profile Alignment (Prof.)98.7
14
Tool Use Reasoningτ-Bench
Avg Accuracy63.9
14
Tool Useτ-Bench (TauB) V2
Accuracy91.6
13
Long-context language tasks (MC, QA, Sum)∞Bench
MC Accuracy78.6
13
Long-context Question Answering∞Bench
Accuracy78.46
13
Story Question Answering∞Bench En.MC
Accuracy90
12
Customer Support Interactionτ-Bench Telecom Verified (test)
Pass@194
11
Customer Support Interactionτ-Bench Retail Verified (test)
Pass Rate92
11
Customer Support Interactionτ-Bench Airline Verified (test)
Pass@182
11
Long-Text Understanding∞BENCH (test)
Overall Accuracy85.6
10
End-to-end task completionτ-Bench Retail, N=5
Task Completion Rate0.111
8
Tool Useτ²-Bench (out-of-distribution)
Retail Score54.9
8
Question Answering∞Bench Zh.QA
F1 Score49.1
7
Question Answering∞Bench En.QA
F1 Score42.1
7
Showing 25 of 51 rows