Share your thoughts, 1 month free Claude Pro on usSee more

τ-Bench

Benchmarks

Task Name	Dataset Name	SOTA Result
Agentic Reasoning	τ-Bench	Score62.58	100
Tool-use Agent Performance	τ²-bench	ASR72.4	50
Tool-use	τ-Bench	Average Pass@185.5	45
Long-context Reasoning	∞ Bench	Accuracy90.39	32
Agent Task Completion	τ-bench-retail	Success Rate70.2	31
Agent Task Completion	τ-BENCH (test)	Average Task Reward0.791	27
User Simulator Goal Alignment	τ-Bench Retail (test)	User Profile Success Rate94.5	19
Agentic Task Completion	τ-bench Average	Success Rate40	18
Conversational Tool-use	τ²-Bench	Airline Success Rate75.5	18
Behavioral Similarity Analysis	τ-Bench and τ2-Bench (test)	GED Score82.6	18
Agentic Tool Use	τ²-Bench Telecom	Accuracy100	18
Question Answering	∞-Bench Longbook QA English (test)	F1 Score11.2	18
Agent Task Performance	τ2-Bench	Retail Performance79.4	16
Conversational function-calling	τ-bench retail (public leaderboard)	Score80.9	14
User simulator goal alignment	τ-Bench Retail	User Profile Adherence94.5	14
User simulator goal alignment	τ-Bench Airline	User Profile Alignment (Prof.)98.7	14
Tool Use Reasoning	τ-Bench	Avg Accuracy63.9	14
Tool Use	τ-Bench (TauB) V2	Accuracy91.6	13
Long-context language tasks (MC, QA, Sum)	∞Bench	MC Accuracy78.6	13
Long-context Question Answering	∞Bench	Accuracy78.46	13
Tool-use Agent Evaluation	τ²-BENCH airline full 50-task pool	Pass@4 Success Rate78	12
Interactive tool-use	τ-bench retail domain	Action Reward28.7	12
Tool-use agent evaluation	τ-bench airline	Pass@176	12
Story Question Answering	∞Bench En.MC	Accuracy90	12
Customer Support Interaction	τ-Bench Telecom Verified (test)	Pass@194	11

Showing 25 of 63 rows