| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| GUI Grounding | OSWorld-G | Average Score72.7 | 107 | |
| GUI Grounding | OSWorld-G (test) | Element Accuracy78.4 | 52 | |
| OS GUI Agentic Task Execution | OSWorld 361 tasks (Verified) | OS Success Rate79.17 | 43 | |
| Computer Use | OSWorld | OS Success Rate67.84 | 42 | |
| Computer task execution | OSWorld (verified) | Office Task Score64.8 | 24 | |
| Grounding | OSWorld | Overall Score64.7 | 22 | |
| Grounding | OSworld G-R | Accuracy76.4 | 19 | |
| GUI Grounding | OSWorld G-Refine v1.0 (test) | Overall Success Rate75 | 17 | |
| End-to-End Environment Interaction | OSWorld-Verified (test) | Pass@161.4 | 16 | |
| GUI Agent Task Success | OSWorld | Success Rate24.4 | 16 | |
| Attack Success Rate (ASR) Evaluation | OSWorld (885-sample split) | Eligible Rate98.08 | 15 | |
| End-to-end task execution | OSWorld (test) | Success Rate38.54 | 12 | |
| Computer task execution | OSWorld 361 tasks | Overall Success Rate54.65 | 10 | |
| Desktop UI Navigation | OSWorld 50 easy tasks 1.0 (test) | ASR100 | 10 | |
| GUI Automation | OSWorld Verified (test) | Overall Success Rate61.92 | 9 | |
| GUI Agent Task Completion | OSWorld 1.0 (test) | Success Rate (Chrome)44.4 | 9 | |
| GUI Navigation | OSWorld | Accuracy28.2 | 9 | |
| Reward Modeling | OSWorld-Verified (Class-Imbalanced, Human Evaluation) 1.0 (test) | Precision88.5 | 7 | |
| Reward Modeling | OSWorld Verified Class-Imbalanced Test Scripts 1.0 (test) | Precision61.9 | 7 | |
| Reward Modeling | OSWorld Verified Class-Balanced Human Evaluation 1.0 (test) | Precision94 | 7 | |
| Reward Modeling | OSWorld Verified Class-Balanced Scripts 1.0 (test) | Precision79.2 | 7 | |
| Multimodal Agent Evaluation | OSWorld | Pearson r0.73 | 6 | |
| Single-agent system security evaluation | OSWORLD (test) | ASR36.7 | 6 | |
| Computer Use | OSWorld (test) | Success Rate42.5 | 6 | |
| Reward Prediction | OSWorld Chrome | Reward Accuracy93.5 | 5 |