Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Web Navigation and Automation on WorkArena Held-out Tasks (test)
Loading...
70
Success Rate
Claude-3.5-Sonnet
1.568
19.334
37.1
54.866
Jul 5, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
Claude-3.5-Sonnet
Backbone Model=Claude-...
2025.07
70
o1-Mini
Backbone Model=o1-Mini
2025.07
68.6
Llama-3.1-405B-Instruct
Backbone Model=Llama-3...
2025.07
58.6
GPT-4o
Backbone Model=GPT-4o
2025.07
55.7
Llama-3.3-70B Instruct (Teacher)
Backbone Model=Llama-3...
2025.07
44
Llama-3.1-70B-Instruct
Backbone Model=Llama-3...
2025.07
32.9
Llama-3.1-8B SFT+RL (Ours)
Backbone Model=Llama-3...
2025.07
28.8
GPT-4o-Mini
Backbone Model=GPT-4o-...
2025.07
28.6
Qwen-2.5-72B Instruct (Teacher)
Backbone Model=Qwen-2....
2025.07
27
Llama-3.1-8B SFT (Ours)
Backbone Model=Llama-3...
2025.07
26.4
Qwen-2.5-7B SFT+RL (Ours)
Backbone Model=Qwen-2....
2025.07
25
Qwen-2.5-7B SFT (Ours)
Backbone Model=Qwen-2....
2025.07
21
Llama-3.1-8B RL (Ours)
Backbone Model=Llama-3...
2025.07
11.5
Qwen-2.5-7B Instruct (Student)
Backbone Model=Qwen-2....
2025.07
10
Qwen-2.5-7B RL (Ours)
Backbone Model=Qwen-2....
2025.07
7
Llama-3.1-8B Instruct (Student)
Backbone Model=Llama-3...
2025.07
4.2
Feedback
Search any
task
Search any
task