Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

τ2-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic Task Performanceτ2-Bench Airline 1.0 (test)
CAP96.4
48
Agentic Task Performanceτ2-Bench Retail 1.0 (test)
Completion Accuracy (CAP)91.4
48
Agentic Workflow Successτ2-bench
Airline Success Rate76.5
43
Agentτ2-Bench
Accuracy85.4
41
Agent Task Completionτ2-BENCH (test)
Average Task Reward0.921
27
Long-Horizon User-Centric Interactionτ2-Bench
Telecom Success Rate46.9
23
Multi-turn tool callingτ2-bench
Airline Score38
19
Agentic Task Completionτ2-Bench
Airline Success Rate84
11
Agentic taskτ2-Bench Telecom
Avg@2 Score45
8
Agentic taskτ2-Bench Airline
Avg@460
8
Agentic taskτ2-Bench Retail
Avg@469.7
8
Agenticτ2-Bench
Score91.6
7
Web-based Decision-makingτ2 Bench Retail, Telecom, Airline
Retail Score48.3
5
Agentic Task Completionτ2-bench Retail
Success Rate100
4
Agentic Task Completionτ2-bench Airline
Success Rate97
4
Memory-Poisoning Attackτ2-Bench
Attack Hit Rate (AHR)90.5
3
Agentic Task Completionτ2-bench Telecom
Success Rate100
3
Task Resolutionτ2-bench (test)
Success Rate (Airline)38.3
2
Showing 18 of 18 rows