Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WebArena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Web navigation and task completionWebArena (test)
Average Task Completion15.45
137
Web navigationWebArena
Overall Success Rate52.6
48
GUI Agent Planning and ExecutionWebArena
Success Rate (Gitlab)70
32
Web NavigationWebArena Lite
Reddit SR73.7
16
Web Agent Task SuccessWebArena Overall 684 tasks
Task Success Rate (SR)56.3
15
Web Agent Task SuccessWebArena Multi 29 tasks
Task Success Rate (SR)20.7
15
Web Agent Task SuccessWebArena Reddit 106 tasks
Success Rate (SR)83
15
Web Agent Task SuccessWebArena Gitlab 180 tasks
Task Success Rate (SR)46.7
15
Web Agent Task SuccessWebArena 182 tasks (Admin)
Success Rate (SR)58.2
15
Web Agent Task SuccessWebArena Shopping 187 tasks
Task Success Rate54
15
Reward ModelingWebArena
Pairwise Accuracy88.43
13
Web Agent Task SuccessWebArena
Task Success Rate (TSR)27.8
12
Web Agent NavigationWebArena (test)
Success Rate53.8
10
Web navigationWebArena Lite In-distribution
Avg Score (GitLab)60
9
Web AgentWebArena-Lite OOD
Average Score (Reddit)0.18
9
Web Agent NavigationWebArena Lite
Success Rate47.9
9
Web Agent NavigationWebArena
Success Rate27.8
8
Human CorrelationWebArena
Pearson Correlation Coefficient (r)0.79
8
Web NavigationWebArena self-hosted websites
Reddit SR43.8
8
Trajectory searchWebArena Lite 1.0 (test)
Shopping Success44.44
8
LLM Agent ReasoningWebArena
Accuracy46.5
7
Autonomous Web NavigationWebArena 812 tasks
Success Rate71.2
7
Web navigation / Agent interactionWebArena full 812-task
Success Rate71.6
6
Attack Success Rate (ASR_B)VisualWebArena Task A pseudo trajectories
Action Rate19.5
6
Web NavigationWebArena Multi-site
Average Steps10.75
6
Showing 25 of 37 rows