| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Performance | ACEBench Agent | Agent Score78 | 36 | |
| Multi-turn agent task | ACEBench multi-turn (test) | Process Accuracy76.5 | 15 | |
| Agentic Performance | ACEBench Agent | End-to-End Accuracy60 | 15 | |
| Cross-Lingual Planning | ACEBench | Score (En)78.3 | 14 | |
| Agent Capability Evaluation | ACEBench Agent | Multi-Step Reasoning Score95 | 13 | |
| Agentic Tool-use | ACEBench (agent-task) | Multi Turn Success Rate97.5 | 13 | |
| Function Calling | ACEBench Normal | Accuracy75.6 | 13 | |
| Function Calling | ACEBench Normal (test) | Summary Score53 | 11 | |
| Tool Use | ACEBench-en (out-of-distribution) | Normal Score77.9 | 8 | |
| Multi-turn Dialogue | ACEBench En | MT Accuracy68 | 7 | |
| Agentic Performance | ACEBench-en | End-to-End Accuracy56 | 7 | |
| Agentic Performance | ACEBench-zh | Accuracy89.6 | 5 |