Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OSWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
GUI GroundingOSWorld-G
Average Score72.7
74
GUI GroundingOSWorld-G (test)
Element Accuracy78.4
52
Computer task executionOSWorld (verified)
Office Task Score64.8
24
Computer UseOSWorld
OS Success Rate42.9
22
OS GUI Agentic Task ExecutionOSWorld 361 tasks (Verified)
Average Success Rate65.84
21
GroundingOSworld G-R
Accuracy76.4
19
GUI GroundingOSWorld G-Refine v1.0 (test)
Overall Success Rate75
17
End-to-End Environment InteractionOSWorld-Verified (test)
Pass@161.4
16
Desktop UI NavigationOSWorld 50 easy tasks 1.0 (test)
ASR100
10
GUI AutomationOSWorld Verified (test)
Overall Success Rate61.92
9
GUI Agent Task CompletionOSWorld 1.0 (test)
Success Rate (Chrome)44.4
9
GUI NavigationOSWorld
Accuracy28.2
9
GUI Agent Task SuccessOSWorld
Success Rate0.601
8
Reward ModelingOSWorld-Verified (Class-Imbalanced, Human Evaluation) 1.0 (test)
Precision88.5
7
Reward ModelingOSWorld Verified Class-Imbalanced Test Scripts 1.0 (test)
Precision61.9
7
Reward ModelingOSWorld Verified Class-Balanced Human Evaluation 1.0 (test)
Precision94
7
Reward ModelingOSWorld Verified Class-Balanced Scripts 1.0 (test)
Precision79.2
7
Single-agent system security evaluationOSWORLD (test)
ASR36.7
6
Computer UseOSWorld (test)
Success Rate42.5
6
Computer UseOSWorld (Verified)
Score66.3
5
Desktop operating system task executionOSWorld
AUV20.3
5
GUI Agent InteractionOSWorld w/o Loop
AUV29.5
5
GUI Agent InteractionOSWorld w/ Loop
AUV6.9
5
Showing 23 of 23 rows