| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| WebShop | W2SG with MCTS | Success Rate99 | 30 | 1mo ago | |
| AlfWorld | DeepSeek-R1 | Success Rate83.6 | 21 | 1mo ago | |
| Sudoku | Success Rate (SR)99 | 17 | 1mo ago | ||
| FrozenLake | Success Rate100 | 17 | 1mo ago | ||
| BlocksWorld | Success Rate100 | 17 | 1mo ago | ||
| AppWorld Challenge (test) | ReAct + ACE | Task Goal Completion (TGC)66 | 13 | 18d ago | |
| ScienceWorld | E3-TIR | Success Rate67.5 | 11 | 5d ago | |
| AppWorld Average | ReAct + ACE | Average Score59.5 | 7 | 18d ago | |
| AppWorld Normal (test) | ReAct + ACE | TGC76.2 | 7 | 18d ago | |
| AlfWorld | Ceiling Model | Average Reward59 | 7 | 1mo ago |