| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ-Bench | IOA | Score62.58 | 100 | 1mo ago | |
| ALFWorld (test) | teacher top-K local support matching | Success Rate97.7 | 21 | 22d ago | |
| GAIA (val) | InternAgent-1.5 | Average Score86.06 | 17 | 1mo ago | |
| WebShop (test) | RETROAGENT | Success Rate82.3 | 15 | 1mo ago | |
| MineSweeper (test) | RETROAGENT | Success Rate48.2 | 12 | 1mo ago | |
| Sokoban (test) | RETROAGENT | Success Rate38.3 | 12 | 1mo ago | |
| FRAMES n=50 (full) | Accuracy77.31 | 8 | 1mo ago | ||
| HLE | Overall Score41.6 | 7 | 1mo ago | ||
| GAIA Text | OctoTools | Accuracy18.4 | 4 | 4d ago | |
| TauBench V2 | Qwen3.5-122B-A10B | Airline Score66 | 3 | 4d ago | |
| Terminal Bench Core 2.0 | Qwen3.5-122B-A10B | Success Rate37.5 | 3 | 4d ago | |
| Terminal Bench hard | Qwen3.5-122B-A10B | Success Rate26.8 | 3 | 4d ago | |
| TIR-Bench | PyVision-Image | Accuracy19.8 | 3 | 1mo ago | |
| τ-Bench (test) | - | Score- | 0 | 1mo ago |