Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

τ-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agentic Reasoningτ-Bench
Score62.58
100
Long-context Reasoning∞ Bench
Accuracy90.39
32
Agent Task Completionτ-BENCH (test)
Average Task Reward0.791
27
Tool Use Reasoningτ-Bench
Avg Accuracy63.9
14
Long-context language tasks (MC, QA, Sum)∞Bench
MC Accuracy78.6
13
Long-context Question Answering∞Bench
Accuracy78.46
13
Tool-use Agent Performanceτ²-bench
Pass@156.4
12
Tool Useτ²-Bench (out-of-distribution)
Retail Score54.9
8
Agentic Dialogueτ-Bench (test)
Retail Accuracy60.4
7
Agent Task Completionτ²-Bench
Avg Task Reward92.1
2
Text-to-All GenerationBench
CLIP-FID (FG)25.5
2
Background GenerationBench
CLIP-FID (Compositional)21
2
Foreground GenerationBench
CLIP-FID (Comp.)13.4
2
Failure attributionτ-bench
Agent Accuracy75.9
2
Agentic Reasoningτ-Bench (test)
Score-
0
Showing 15 of 15 rows