| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| WebVoyager | GPT-5 (SoM) | Success Rate90.6 | 68 | 1d ago | |
| WebArena | STRUCTUREDAGENT | Overall Success Rate52.6 | 55 | 1mo ago | |
| Mind2Web | Triton-GRPO-32B | Overall Success Rate58.7 | 41 | 22d ago | |
| Mind2Web Cross-Domain | FluxMem | Element Accuracy (EA)65.2 | 37 | 6d ago | |
| WebShop | ICRL | Success Rate76 | 32 | 2d ago | |
| WebShop Source | MemoryBank | Success Rate100 | 27 | 3mo ago | |
| WebArena Lite | Gemma3 + MiRA | Gitlab SR56.7 | 24 | 15d ago | |
| MM-Mind2Web | Step Success Rate (SR)22.97 | 22 | 14d ago | ||
| WebArena Lite v2 | GPT-4o + ScaleCUA-7B | Average Success Rate28.6 | 19 | 7d ago | |
| WebShop Drift II | Vanilla + GLOVE | Success Rate95 | 18 | 3mo ago | |
| WebShop Drift I | Generative Agent + GLOVE | Success Rate95 | 18 | 3mo ago | |
| WebShop Drift II - Semantic Shift | Voyager + GLOVE | Success Rate95 | 18 | 3mo ago | |
| WebShop Drift I - Semantic Shift | Vanilla + GLOVE | Success Rate95 | 18 | 3mo ago | |
| WebShop (test) | MAGE | Score90.2 | 16 | 21d ago | |
| Mind2Web Live (test) | Task Completion Rate52.8 | 16 | 3mo ago | ||
Success Rate (SR)6.7 | 15 | 2d ago | |||
| Classifieds | Success Rate13.7 | 15 | 2d ago | ||
| Mind2Web Service | Mem-W-8B | Success Rate36.27 | 15 | 22d ago | |
| MMInA Shop | Mem-W-8B | Success Rate48.5 | 15 | 22d ago | |
| Multimodal-Mind2Web Cross-Domain | Skill-CMIB | Element Accuracy57.4 | 15 | 22d ago | |
| Multimodal-Mind2Web Cross-Website | Explorer-7B | Element Accuracy60.5 | 15 | 22d ago | |
| Multimodal-Mind2Web Cross-Task | AgentTrek-7B | Element Accuracy60.8 | 15 | 22d ago | |
| MiniWob++ | Explorer-7B | Accuracy53.26 | 15 | 3mo ago | |
| WebShop unseen (test) | ProxMO | Score87.2 | 14 | 3mo ago | |
| WebVoyager (test) | Success Rate87 | 14 | 3mo ago |