| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic task solving | AppWorld | TGC90 | 28 | |
| Multi-turn tool-use | AppWorld | Avg@463.6 | 25 | |
| Agentic Tool-use | AppWorld (Challenge) | TGC83.7 | 20 | |
| Agentic Tool-use | AppWorld Normal | Task Goal Completion (TGC)89.3 | 20 | |
| Tool-use agentic performance | AppWorld | Avg@464.88 | 19 | |
| Task and Scenario Goal Completion | AppWorld normal (test) | Task Goal Completion91.2 | 18 | |
| Task goal completion | AppWorld (test challenge) | Goal Completion Score32 | 16 | |
| Interactive environment task execution | AppWorld normal (test) | Avg@8 Success4,554 | 15 | |
| Agent Task | AppWorld Challenge (test) | Task Goal Completion (TGC)66 | 13 | |
| Agentic Task Completion | AppWorld LeaderBoard | Greedy Success Rate48.8 | 13 | |
| Tool Shortlisting | AppWorld v1.0 (test) | R-precision (AZ)0.71 | 9 | |
| Agent Task | AppWorld Average | Average Score59.5 | 7 | |
| Agent Task | AppWorld Normal (test) | TGC76.2 | 7 | |
| Multimodal app-use reasoning | AppWorld | Cost0.05 | 7 | |
| Agent-based interactive task execution | AppWorld | Accuracy64.9 | 5 | |
| Agentic Task Solving | AppWorld (test-n) | TGC Average81.15 | 4 | |
| Task-goal completion | AppWorld Challenge Qwen-2.5-32B (test) | Average Task Completion Score51 | 4 | |
| Task-goal completion | AppWorld Normal Qwen-2.5-32B (test) | Average Task Completion Score75 | 4 | |
| Web Task Execution | AppWorld Normal (test) | Task Goal Success Rate89.5 | 4 | |
| App-based Agentic Task | AppWorld unseen tasks (test) | Pass@166.6 | 3 | |
| Agent Task and Scenario Completion | AppWorld (dev) | Task Goal Completion89.5 | 2 | |
| Agent Task and Scenario Completion | AppWorld (train) | Task Goal Completion91.1 | 2 |