| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Task Completion | ToolSandbox (test) | Avg Task Reward0.704 | 27 | |
| Tool Use Evaluation | ToolSandbox | Similarity0.923 | 12 | |
| Multi-turn agent decision making | ToolSandbox (test) | Success Rate52.2 | 7 | |
| Agent Task Completion | ToolSandbox | Average Task Reward0.67 | 2 |