| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| WebShop | W2SG with MCTS | Success Rate99 | 50 | 22d ago | |
| AlfWorld | ResRL | Success Rate86.7 | 40 | 5d ago | |
| PinchBench (PB) | LLM-guided spec search | Accuracy100 | 21 | 15d ago | |
| AppWorld Normal (test) | ReAct + ACE | TGC76.2 | 20 | 1mo ago | |
| Sudoku | Success Rate (SR)99 | 17 | 3mo ago | ||
| FrozenLake | Success Rate100 | 17 | 3mo ago | ||
| BlocksWorld | Success Rate100 | 17 | 3mo ago | ||
| Agent | Llama-3.2-1B-Instruct | Accuracy100 | 16 | 13d ago | |
| ToolBench | PAIR | Success Rate44.98 | 16 | 15d ago | |
| GTA | PAIR | Success Rate24.89 | 16 | 15d ago | |
| AppWorld Challenge (test) | ReAct + ACE | Task Goal Completion (TGC)66 | 13 | 2mo ago | |
| ScienceWorld | E3-TIR | Success Rate67.5 | 11 | 1mo ago | |
| AppWorld Average | ReAct + ACE | Average Score59.5 | 7 | 2mo ago | |
| AlfWorld | Ceiling Model | Average Reward59 | 7 | 2mo ago |