| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ2-Bench Airline 1.0 (test) | NOD | CAP96.4 | 48 | 21d ago | |
| τ2-Bench Retail 1.0 (test) | NOD | Completion Accuracy (CAP)91.4 | 48 | 21d ago | |
| τ-Telecom | FAMA | Pass@1 Success Rate52 | 16 | 1mo ago | |
| τ-Telehealth | FAMA | Pass^1 Rate45 | 16 | 1mo ago | |
| Agent Capabilities | Success Rate90.4 | 15 | 2mo ago | ||
| AssistantBench (test) | Learning to Share | Easy Accuracy65.8 | 6 | 3mo ago | |
| Agent Task Benchmark 240 documents 1.0 (Evaluation set) | OBJECTGRAPH(E) | Information Lookup Success Rate92.3 | 4 | 1mo ago |