| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ALFWorld | Deepseek-V4-Pro | Overall Success Rate99.6 | 295 | 21h ago | |
| ScienceWorld Seen | GRASP | Success Rate88.8 | 72 | 15d ago | |
| ALFWorld (test) | SELAUR | Success Rate96.87 | 71 | 22d ago | |
| WebShop | SAMPO | Success Rate84.02 | 70 | 21h ago | |
| ScienceWorld | Minimax-M2.7 | Success Rate54.2 | 42 | 20d ago | |
| Textcraft | GPT-4o-1120 | Success Rate99.6 | 42 | 20d ago | |
| InterCode NL2Bash | GRASP | Success Rate79.6 | 40 | 1mo ago | |
| WebShop (Seen) | GRASP | Average Reward62.3 | 40 | 1mo ago | |
| WebShop (test) | BPO | Success Rate97 | 37 | 15d ago | |
| ALFWorld Unseen | STEP-HRL | Success Rate97.76 | 32 | 15d ago | |
| ALFWorld Seen | STEP-HRL | Success Rate97.86 | 32 | 15d ago | |
| ScienceWorld Unseen | BPO | Success Rate85.15 | 32 | 15d ago | |
| WebShop | GPT-5 | Real39 | 24 | 1mo ago | |
| TextWorld | GPT-5 | Real100 | 24 | 1mo ago | |
| ScienceWorld Unseen (test) | ITPR | Success Rate58.94 | 24 | 3mo ago | |
| ScienceWorld Seen (val) | AdaPlan-H | Average Reward0.7349 | 20 | 1mo ago | |
| ALFWorld Seen (val) | AdaPlan-H | Average Reward0.8677 | 20 | 1mo ago | |
| ALFWorld unseen (test) | AdaPlan-H | Average Reward88.81 | 20 | 1mo ago | |
| Virtualhome | HISR | Success Rate59.1 | 15 | 2mo ago | |
| ScienceWorld (OOD) | MTRouter | Score9.9 | 14 | 1mo ago | |
| ScienceWorld (test) | MTRouter | Score53.8 | 14 | 1mo ago | |
| ALFWorld (qualitative context) | Success Rate (SR)99 | 8 | 2mo ago |