| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic task solving | AppWorld | TGC90 | 28 | |
| Agentic Task | AppWorld C (test) | PR Score48.4 | 26 | |
| Agentic Task | AppWorld N (test) | PR Score57.1 | 26 | |
| Multi-turn tool-use | AppWorld | Avg@463.6 | 25 | |
| Agentic Task Completion | AppWorld (test-normal) | Accuracy56.5 | 22 | |
| Agentic Tool-use | AppWorld (Challenge) | TGC83.7 | 20 | |
| Agentic Tool-use | AppWorld Normal | Task Goal Completion (TGC)89.3 | 20 | |
| Agent Task | AppWorld Normal (test) | TGC76.2 | 20 | |
| Tool-use agentic performance | AppWorld | Avg@464.88 | 19 | |
| Task and Scenario Goal Completion | AppWorld normal (test) | Task Goal Completion91.2 | 18 | |
| Task goal completion | AppWorld (test challenge) | Goal Completion Score32 | 16 | |
| Interactive environment task execution | AppWorld normal (test) | Avg@8 Success4,554 | 15 | |
| Agentic Task Completion | AppWorld Challenge (test) | Task Goal Completion (TGC)49.88 | 13 | |
| Agent Task | AppWorld Challenge (test) | Task Goal Completion (TGC)66 | 13 | |
| Agentic Task Completion | AppWorld LeaderBoard | Greedy Success Rate48.8 | 13 | |
| Scenario-level policy synthesis | AppWorld normal (test) | Task Goal Completion (TGC)98.2 | 12 | |
| Agentic Task Completion | AppWorld normal Hard (test) | Accuracy39.7 | 11 | |
| Agentic Task Completion | AppWorld Easy normal (test) | Accuracy86 | 11 | |
| Interactive coding-centric agent tasks | AppWorld | Success Rate22.6 | 10 | |
| App-based Task Execution | AppWorld-Challenge | Task Goal Completion (TGC)52.8 | 10 | |
| App-based Task Execution | AppWorld Normal | Task Goal Completion (TGC)71.4 | 10 | |
| Tool Shortlisting | AppWorld v1.0 (test) | R-precision (AZ)0.71 | 9 | |
| Task Goal Completion | AppWorld | Average Completion Score @418.46 | 7 | |
| Agent task completion | AppWorld | TGC Success Rate (N)83.7 | 7 | |
| Scenario-level policy synthesis | AppWorld challenge (test) | TGC98.3 | 7 |