Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OSWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
GUI GroundingOSWorld-G
Average Score72.7
144
GUI GroundingOSWorld-G (test)
Element Accuracy78.4
52
Computer UseOSWorld
OS Success Rate75
45
OS GUI Agentic Task ExecutionOSWorld 361 tasks (Verified)
OS Success Rate79.17
43
Operating System GUI Agentic ReasoningOSWorld
Success Rate64.29
42
GUI AutomationOSWorld Verified (test)
Overall Success Rate61.92
40
UI Agent EvaluationOSWorld
SR (15 Steps)40.3
34
GUI NavigationOSWorld Verified
OS Success Rate91.7
32
GUI Agent InteractionOSWorld
Average Accuracy42.5
24
Computer task executionOSWorld (verified)
Office Task Score64.8
24
GroundingOSWorld
Overall Score64.7
22
GUI Agent Task CompletionOSWorld 1.0 (test)
Success Rate (GIMP)82.05
20
GroundingOSworld G-R
Accuracy76.4
19
Interactive Desktop Task SuccessOSWorld
Chrome Success Rate59.91
18
GUI GroundingOSWorld G-Refine v1.0 (test)
Overall Success Rate75
17
GUI Agent InteractionOSWorld
Success Rate (Max Steps: 15)42.9
16
End-to-End Environment InteractionOSWorld-Verified (test)
Pass@161.4
16
GUI Agent Task SuccessOSWorld
Success Rate24.4
16
Task accuracyOSWorld
Task Accuracy41.49
15
Multimodal Task AccuracyOSWorld
Multimodal Task Accuracy41.49
15
Attack Success Rate (ASR) EvaluationOSWorld (885-sample split)
Eligible Rate98.08
15
GUI GroundingOSWorld-G refined annotation
Text Match83.5
14
Computer Use Agent NavigationOSWorld (Verified)
Success Rate78.7
13
End-to-end task executionOSWorld (test)
Success Rate38.54
12
Computer UseOSWorld (Verified)
Score75
12
Showing 25 of 53 rows