Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

τ2-bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic Workflow Successτ2-bench
Airline Success Rate76.5
43
Agent Task Completionτ2-BENCH (test)
Average Task Reward0.921
27
Long-Horizon User-Centric Interactionτ2-Bench
Telecom Success Rate46.9
23
Agentτ2-Bench
Accuracy85.4
9
Agentic Task Completionτ2-Bench
Airline Success Rate44
7
Agenticτ2-Bench
Score91.6
7
Multi-turn tool callingτ2-bench
Overall Score17.77
5
Web-based Decision-makingτ2 Bench Retail, Telecom, Airline
Retail Score48.3
5
Task Resolutionτ2-bench (test)
Success Rate (Airline)38.3
2
Showing 9 of 9 rows