τ2-bench

Benchmarks

Task Name	Dataset Name	SOTA Result
Agentic Task Performance	τ2-Bench Airline 1.0 (test)	CAP96.4	48
Agentic Task Performance	τ2-Bench Retail 1.0 (test)	Completion Accuracy (CAP)91.4	48
Agentic Workflow Success	τ2-bench	Airline Success Rate76.5	43
Agent	τ2-Bench	Accuracy85.4	41
Uncertainty quantification	τ2-bench Retail	AUROC0.899	32
Uncertainty quantification	τ2-bench Airline	AUROC86.5	32
Agent Task Completion	τ2-BENCH (test)	Average Task Reward0.921	27
Long-Horizon User-Centric Interaction	τ2-Bench	Telecom Success Rate46.9	23
Agentic Task Completion	τ2-bench Airline	Success Rate97	22
Agentic	τ2-Bench	Score91.6	20
Agentic Task Completion	τ2-Bench	Airline Success Rate84	19
Multi-turn tool calling	τ2-bench	Airline Score38	19
Agentic task	τ2-Bench Telecom	Avg@2 Score45	8
Agentic task	τ2-Bench Airline	Avg@460	8
Agentic task	τ2-Bench Retail	Avg@469.7	8
Web-based Decision-making	τ2 Bench Retail, Telecom, Airline	Retail Score48.3	5
Agentic Task Completion	τ2-bench Retail	Success Rate100	4
Tool-use Policy Auditing	τ2-BENCH Telecom (test)	PASS^430	3
Tool-use Policy Auditing	τ2-BENCH Telecom base	PASS^420.2	3
Tool-use Policy Auditing	τ2-BENCH Retail (base)	PASS^459.6	3
Tool-use Agent Task	τ2-bench (full)	Retail Success Rate82.46	3
Memory-Poisoning Attack	τ2-Bench	Attack Hit Rate (AHR)90.5	3
Agentic Task Completion	τ2-bench Telecom	Success Rate100	3
Task Resolution	τ2-bench (test)	Success Rate (Airline)38.3	2

Showing 24 of 24 rows