| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Operating System Control | AgentBench OS | Accuracy37.6 | 15 | |
| Tool-calling for Clinical Question Answering | AgentBench FHIR (val) | Score76.7 | 8 | |
| Success Rate | AgentBench | Success Rate34.1 | 8 | |
| Human Correlation | AgentBench | Pearson r0.77 | 8 | |
| Sequential task management and state maintenance | Lifelong AgentBench | Accuracy100 | 5 | |
| Web Shopping | AgentBench Web Shopping | Task Completion Score (TCS)36.2 | 4 | |
| Web Browsing/Site navigation | AgentBench WS | Task Completion Score (TCS)52 | 4 | |
| Long-term Planning | AgentBench LTP | Task Completion Score (TCS)32.3 | 4 | |
| Digital Card Game strategy | AgentBench DCG | Task Completion Score (TCS)73.6 | 4 | |
| Knowledge Graph navigation | AgentBench KG | Task Completion Score (TCS)72.7 | 4 | |
| Database interaction | AgentBench DB | Task Completion Score (TCS)74.3 | 4 |