Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WebArena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Web navigation and task completionWebArena (test)
Average Task Completion15.45
137
Web navigationWebArena
Overall Success Rate52.6
55
Web AgentWebArena
Success Rate35.8
36
GUI Agent Planning and ExecutionWebArena
Success Rate (Gitlab)70
32
Web NavigationWebArena Lite
Gitlab SR56.7
24
Web NavigationWebArena Lite v2
Average Success Rate28.6
19
Web Agent NavigationWebArena
Success Rate45.9
19
Web AgentWebArena Lite
SR21.2
18
Web agent task completionWebArena (test)
Shopping Success Rate42.9
18
Web Agent Task SuccessWebArena Overall 684 tasks
Task Success Rate (SR)56.3
15
Web Agent Task SuccessWebArena Multi 29 tasks
Task Success Rate (SR)20.7
15
Web Agent Task SuccessWebArena Reddit 106 tasks
Success Rate (SR)83
15
Web Agent Task SuccessWebArena Gitlab 180 tasks
Task Success Rate (SR)46.7
15
Web Agent Task SuccessWebArena 182 tasks (Admin)
Success Rate (SR)58.2
15
Web Agent Task SuccessWebArena Shopping 187 tasks
Task Success Rate54
15
Reward ModelingWebArena
Pairwise Accuracy88.43
13
Web Agent Task SuccessWebArena
Task Success Rate (TSR)27.8
12
Prefix-risk rankingWebArena (held-out)
AUPRC90
11
Web Agent NavigationWebArena (test)
Success Rate53.8
10
Continual-memory deploymentWebArena
Macro-Averaged Success Rate52.3
9
Web Navigation and Task AutomationWebArena Continual-memory deployment
Success Rate (SR)52.3
9
Web navigationWebArena Lite In-distribution
Avg Score (GitLab)60
9
Web AgentWebArena-Lite OOD
Average Score (Reddit)0.18
9
Web Agent NavigationWebArena Lite
Success Rate47.9
9
Web Browsing AutomationWebArena (test)
Latency (s)3.9
8
Showing 25 of 52 rows