Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OSWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
GUI GroundingOSWorld-G
Average Score72.7
107
GUI GroundingOSWorld-G (test)
Element Accuracy78.4
52
OS GUI Agentic Task ExecutionOSWorld 361 tasks (Verified)
OS Success Rate79.17
43
Computer UseOSWorld
OS Success Rate67.84
42
Computer task executionOSWorld (verified)
Office Task Score64.8
24
GroundingOSWorld
Overall Score64.7
22
GroundingOSworld G-R
Accuracy76.4
19
GUI GroundingOSWorld G-Refine v1.0 (test)
Overall Success Rate75
17
End-to-End Environment InteractionOSWorld-Verified (test)
Pass@161.4
16
GUI Agent Task SuccessOSWorld
Success Rate24.4
16
Attack Success Rate (ASR) EvaluationOSWorld (885-sample split)
Eligible Rate98.08
15
End-to-end task executionOSWorld (test)
Success Rate38.54
12
Computer task executionOSWorld 361 tasks
Overall Success Rate54.65
10
Desktop UI NavigationOSWorld 50 easy tasks 1.0 (test)
ASR100
10
GUI AutomationOSWorld Verified (test)
Overall Success Rate61.92
9
GUI Agent Task CompletionOSWorld 1.0 (test)
Success Rate (Chrome)44.4
9
GUI NavigationOSWorld
Accuracy28.2
9
Reward ModelingOSWorld-Verified (Class-Imbalanced, Human Evaluation) 1.0 (test)
Precision88.5
7
Reward ModelingOSWorld Verified Class-Imbalanced Test Scripts 1.0 (test)
Precision61.9
7
Reward ModelingOSWorld Verified Class-Balanced Human Evaluation 1.0 (test)
Precision94
7
Reward ModelingOSWorld Verified Class-Balanced Scripts 1.0 (test)
Precision79.2
7
Multimodal Agent EvaluationOSWorld
Pearson r0.73
6
Single-agent system security evaluationOSWORLD (test)
ASR36.7
6
Computer UseOSWorld (test)
Success Rate42.5
6
Reward PredictionOSWorld Chrome
Reward Accuracy93.5
5
Showing 25 of 34 rows