| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Web Agent | WorkArena L2 | Success Rate4.7 | 18 | |
| Web Agent | WorkArena L1 | Success Rate38.8 | 18 | |
| Web Navigation and Automation | WorkArena Held-out Tasks (test) | Success Rate70 | 16 | |
| Web Navigation and Automation | WorkArena Held-out Goals (test) | Success Rate53.8 | 16 | |
| Enterprise interface task completion | WorkArena L1 | Task Success Rate79.7 | 14 | |
| Reward Modeling | WorkArena | Pairwise Accuracy84.33 | 13 | |
| HTML observation reduction | WorkArena | Average Wall-Clock Time (seconds)0.01 | 11 | |
| Web Agent Navigation | WorkArena L2 147-task (test) | Success Rate40 | 10 | |
| Web Agent Navigation | WorkArena L1 (full) | Success Rate79.4 | 10 | |
| Enterprise interface task completion | WorkArena++ L2 | Success Rate41.6 | 9 | |
| Web Task Automation | WorkArena L1 | Average Reward68 | 8 | |
| Enterprise Workflow Automation | WorkArena (test) | M&D Score45.1 | 7 | |
| Web Navigation | WorkArena L2 | Success Rate6.8 | 5 | |
| Web Navigation | WorkArena L1 | Success Rate7.6 | 5 | |
| Web agent interaction | WorkArena L1 | Cumulative Runtime (h/m)0.8033 | 3 | |
| Enterprise interface interaction | WorkArena L2 full benchmark | Success Rate69.4 | 3 | |
| Enterprise interface interaction | WorkArena L2 (test) | Success Rate9.7 | 2 |