| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Decision Making OOD | TIMEOMNI-1 | ACC58.9 | 13 | 3d ago | |
| Decision Making (ID) | TIMEOMNI-1 | Accuracy47.9 | 13 | 3d ago | |
| TSR-Suite Task 4 | TIMEOMNI-VL | Accuracy61.4 | 8 | 3d ago | |
| WebShop | Dual-Process (AUQ) | Success Rate42.9 | 7 | 3d ago | |
| OR-ShARC (test) | EFT | Micro Aggregation Score0.785 | 7 | 3d ago | |
| OR-ShARC (dev) | EFT | Micro Avg83.4 | 7 | 3d ago | |
| AlfWorld | AutoRefine | Transition Success Rate98.4 | 7 | 3d ago | |
| Pandora's Box | Optimal Match Rate1 | 5 | 3d ago | ||
| StarCraft II (SC2) built-in AI LV7 (VeryHard) (test) | StarWM-Agent | Win Rate50 | 4 | 3d ago | |
| StarCraft II built-in AI LV6 (Harder) (test) | StarWM-Agent | Win Rate40 | 4 | 3d ago | |
| StarCraft II built-in AI LV5 (Hard) (out-of-distribution (OOD)) | StarWM-Agent | Win Rate50 | 4 | 3d ago | |
| (test) | ACC1 Score14.2 | 3 | 3d ago | ||
| Art UK-based participants | human-calibrated AI | Final Accuracy8.5 | 1 | 2d ago | |
| All Aggregated (UK-based participants) | human-calibrated AI | Final Accuracy5.2 | 1 | 2d ago | |
| Cities UK-based participants | human-calibrated AI | Accuracy1.7 | 1 | 2d ago | |
| Art 90% AI Accuracy (UK-based participants) | - | Final Accuracy- | 0 | 3d ago |