| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Reasoning | τ-Bench | Score62.58 | 100 | |
| Tool-use Agent Performance | τ²-bench | ASR72.4 | 50 | |
| Tool-use | τ-Bench | Average Pass@167.42 | 38 | |
| Long-context Reasoning | ∞ Bench | Accuracy90.39 | 32 | |
| Agent Task Completion | τ-BENCH (test) | Average Task Reward0.791 | 27 | |
| User Simulator Goal Alignment | τ-Bench Retail (test) | User Profile Success Rate94.5 | 19 | |
| Conversational Tool-use | τ²-Bench | Airline Success Rate75.5 | 18 | |
| Behavioral Similarity Analysis | τ-Bench and τ2-Bench (test) | GED Score82.6 | 18 | |
| Agentic Tool Use | τ²-Bench Telecom | Accuracy100 | 18 | |
| Question Answering | ∞-Bench Longbook QA English (test) | F1 Score11.2 | 18 | |
| User simulator goal alignment | τ-Bench Retail | User Profile Adherence94.5 | 14 | |
| User simulator goal alignment | τ-Bench Airline | User Profile Alignment (Prof.)98.7 | 14 | |
| Tool Use Reasoning | τ-Bench | Avg Accuracy63.9 | 14 | |
| Tool Use | τ-Bench (TauB) V2 | Accuracy91.6 | 13 | |
| Long-context language tasks (MC, QA, Sum) | ∞Bench | MC Accuracy78.6 | 13 | |
| Long-context Question Answering | ∞Bench | Accuracy78.46 | 13 | |
| Story Question Answering | ∞Bench En.MC | Accuracy90 | 12 | |
| Customer Support Interaction | τ-Bench Telecom Verified (test) | Pass@194 | 11 | |
| Customer Support Interaction | τ-Bench Retail Verified (test) | Pass Rate92 | 11 | |
| Customer Support Interaction | τ-Bench Airline Verified (test) | Pass@182 | 11 | |
| Long-Text Understanding | ∞BENCH (test) | Overall Accuracy85.6 | 10 | |
| End-to-end task completion | τ-Bench Retail, N=5 | Task Completion Rate0.111 | 8 | |
| Tool Use | τ²-Bench (out-of-distribution) | Retail Score54.9 | 8 | |
| Question Answering | ∞Bench Zh.QA | F1 Score49.1 | 7 | |
| Question Answering | ∞Bench En.QA | F1 Score42.1 | 7 |