| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| GUI Grounding | OSWorld-G | Average Score72.7 | 74 | |
| GUI Grounding | OSWorld-G (test) | Element Accuracy78.4 | 52 | |
| Computer task execution | OSWorld (verified) | Office Task Score64.8 | 24 | |
| Computer Use | OSWorld | OS Success Rate42.9 | 22 | |
| OS GUI Agentic Task Execution | OSWorld 361 tasks (Verified) | Average Success Rate65.84 | 21 | |
| Grounding | OSworld G-R | Accuracy76.4 | 19 | |
| GUI Grounding | OSWorld G-Refine v1.0 (test) | Overall Success Rate75 | 17 | |
| End-to-End Environment Interaction | OSWorld-Verified (test) | Pass@161.4 | 16 | |
| Desktop UI Navigation | OSWorld 50 easy tasks 1.0 (test) | ASR100 | 10 | |
| GUI Automation | OSWorld Verified (test) | Overall Success Rate61.92 | 9 | |
| GUI Agent Task Completion | OSWorld 1.0 (test) | Success Rate (Chrome)44.4 | 9 | |
| GUI Navigation | OSWorld | Accuracy28.2 | 9 | |
| GUI Agent Task Success | OSWorld | Success Rate0.601 | 8 | |
| Reward Modeling | OSWorld-Verified (Class-Imbalanced, Human Evaluation) 1.0 (test) | Precision88.5 | 7 | |
| Reward Modeling | OSWorld Verified Class-Imbalanced Test Scripts 1.0 (test) | Precision61.9 | 7 | |
| Reward Modeling | OSWorld Verified Class-Balanced Human Evaluation 1.0 (test) | Precision94 | 7 | |
| Reward Modeling | OSWorld Verified Class-Balanced Scripts 1.0 (test) | Precision79.2 | 7 | |
| Single-agent system security evaluation | OSWORLD (test) | ASR36.7 | 6 | |
| Computer Use | OSWorld (test) | Success Rate42.5 | 6 | |
| Computer Use | OSWorld (Verified) | Score66.3 | 5 | |
| Desktop operating system task execution | OSWorld | AUV20.3 | 5 | |
| GUI Agent Interaction | OSWorld w/o Loop | AUV29.5 | 5 | |
| GUI Agent Interaction | OSWorld w/ Loop | AUV6.9 | 5 |