| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ-Bench | IOA | Score62.58 | 100 | 3mo ago | |
| WebShop | EAPO-8B | Success Rate65.58 | 45 | 22d ago | |
| ALFWorld | EAPO-8B | Success Rate76.02 | 45 | 22d ago | |
| ALFWorld (test) | teacher top-K local support matching | Success Rate97.7 | 21 | 2mo ago | |
| LoCoMo | CAMEL | Org. Score72.3 | 20 | 22d ago | |
| ScienceWorld | CAMEL | Original Score82.2 | 20 | 22d ago | |
| ALFWorld | CAMEL | Success Rate (Org.)91 | 20 | 22d ago | |
| AndroidWorld | EAPO-8B | Success Rate82.05 | 20 | 22d ago | |
| GAIA (val) | InternAgent-1.5 | Average Score86.06 | 17 | 3mo ago | |
| WebShop (test) | RETROAGENT | Success Rate82.3 | 15 | 2mo ago | |
| ResearchQA (test) | DR-Rubric-14B (BS-2) | Score73.9 | 14 | 1d ago | |
| MineSweeper (test) | RETROAGENT | Success Rate48.2 | 12 | 2mo ago | |
| Sokoban (test) | RETROAGENT | Success Rate38.3 | 12 | 2mo ago | |
| BALROG | TemplateRL | Accuracy31.5 | 8 | 16d ago | |
| FRAMES n=50 (full) | Accuracy77.31 | 8 | 3mo ago | ||
| HLE | Overall Score41.6 | 7 | 3mo ago | ||
| BFCL v4 (non-live and live) | CopT | Accuracy86.45 | 6 | 14d ago | |
| ALFWorld | HölderPO | Pick Success Rate97.2 | 5 | 21d ago | |
| GAIA Text | OctoTools | Accuracy18.4 | 4 | 1mo ago | |
| VisualProbe Hard | DeepEyes-7B | Accuracy75.9 | 3 | 26d ago | |
| VisualProbe Medium | DeepEyes-7B | Accuracy84.7 | 3 | 26d ago | |
| TauBench V2 | Qwen3.5-122B-A10B | Airline Score66 | 3 | 1mo ago | |
| Terminal Bench Core 2.0 | Qwen3.5-122B-A10B | Success Rate37.5 | 3 | 1mo ago | |
| Terminal Bench hard | Qwen3.5-122B-A10B | Success Rate26.8 | 3 | 1mo ago | |
| TIR-Bench | PyVision-Image | Accuracy19.8 | 3 | 3mo ago |