| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| AlfWorld | Steps6.4 | 22 | 2mo ago | ||
| WebShop | Dual-Process (AUQ) | Success Rate42.9 | 15 | 2mo ago | |
| 30 synthetic decision-making rounds (evaluation) | Lark Full | Mean Rank2.55 | 14 | 1mo ago | |
| Decision Making OOD | TIMEOMNI-1 | ACC58.9 | 13 | 3mo ago | |
| Decision Making (ID) | TIMEOMNI-1 | Accuracy47.9 | 13 | 3mo ago | |
| ALFWorld Avg v1 | BeliefMem | Success Rate (SR)59.88 | 9 | 26d ago | |
| ALFWorld Unseen v1 | BeliefMem | Success Rate (SR)61.19 | 9 | 26d ago | |
| ALFWorld Seen v1 | BeliefMem | Success Rate (SR)63.57 | 9 | 26d ago | |
| Hiring Domain | DECISIVE | Top-1 Accuracy90.2 | 8 | 1mo ago | |
| Finance Domain | DECISIVE | Top-1 Acc.79 | 8 | 1mo ago | |
| Education Domain | DECISIVE | Top-1 Accuracy78.8 | 8 | 1mo ago | |
| TSR-Suite Task 4 | TIMEOMNI-VL | Accuracy61.4 | 8 | 3mo ago | |
| OAS (test) | Timely Score0.5037 | 7 | 2mo ago | ||
| GlobalStore (test) | Timeliness Score35.52 | 7 | 2mo ago | ||
| DataCo (test) | S2A | Timely Score0.5447 | 7 | 2mo ago | |
| OR-ShARC (test) | EFT | Micro Aggregation Score0.785 | 7 | 3mo ago | |
| OR-ShARC (dev) | EFT | Micro Avg83.4 | 7 | 3mo ago | |
| vignette-based (inference) | MeTHanol (8B) | Vignette Score48.3 | 6 | 1mo ago | |
| highway-env | Average Speed32.19 | 5 | 21d ago | ||
| Deliberative decision-making tasks n=45 (overall) | DCI | Mean Tokens237,565 | 5 | 2mo ago | |
| Pandora's Box | Optimal Match Rate1 | 5 | 3mo ago | ||
| SinerGym (test) | Vintix II | Normalized Score92 | 4 | 1mo ago | |
| MuJoCo (test) | Vintix II | Normalized Score1 | 4 | 1mo ago | |
| MetaDrive (test) | Vintix | Normalized Score1.02 | 4 | 1mo ago | |
| Meta-World (test) | Vintix II | Normalized Score69 | 4 | 1mo ago |