| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic task solving | AppWorld | TGC90 | 28 | |
| Multi-turn tool-use | AppWorld | Avg@463.6 | 25 | |
| Agentic Task Completion | AppWorld LeaderBoard | Greedy Success Rate48.8 | 13 | |
| Tool Shortlisting | AppWorld v1.0 (test) | R-precision (AZ)0.71 | 9 | |
| Interactive environment task execution | AppWorld normal (test) | Avg@8 Success4,554 | 9 | |
| Multimodal app-use reasoning | AppWorld | Cost0.05 | 7 |