Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Web Navigation and Automation on WorkArena Held-out Goals (test)
Loading...
53.8
Success Rate
o1-Mini
2.528
15.839
29.15
42.461
Jul 5, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
o1-Mini
Backbone Model=o1-Mini
2025.07
53.8
Claude-3.5-Sonnet
Backbone Model=Claude-...
2025.07
52.5
GPT-4o
Backbone Model=GPT-4o
2025.07
42.1
Llama-3.1-405B-Instruct
Backbone Model=Llama-3...
2025.07
39.2
Llama-3.3-70B Instruct (Teacher)
Backbone Model=Llama-3...
2025.07
36
Llama-3.1-8B SFT+RL (Ours)
Backbone Model=Llama-3...
2025.07
35.4
Qwen-2.5-72B Instruct (Teacher)
Backbone Model=Qwen-2....
2025.07
33.3
Qwen-2.5-7B SFT+RL (Ours)
Backbone Model=Qwen-2....
2025.07
32.5
Llama-3.1-8B SFT (Ours)
Backbone Model=Llama-3...
2025.07
28.4
GPT-4o-Mini
Backbone Model=GPT-4o-...
2025.07
27.1
Llama-3.1-70B-Instruct
Backbone Model=Llama-3...
2025.07
25
Qwen-2.5-7B SFT (Ours)
Backbone Model=Qwen-2....
2025.07
25
Llama-3.1-8B Instruct (Student)
Backbone Model=Llama-3...
2025.07
8.3
Llama-3.1-8B RL (Ours)
Backbone Model=Llama-3...
2025.07
8
Qwen-2.5-7B Instruct (Student)
Backbone Model=Qwen-2....
2025.07
5.2
Qwen-2.5-7B RL (Ours)
Backbone Model=Qwen-2....
2025.07
4.5
Feedback
Search any
task
Search any
task