Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Web Task Completion on MiniWoB++ Held-out Tasks (test)
Loading...
70.4
Success Rate
Claude-3.5-Sonnet
35.04
44.22
53.4
62.58
Jul 5, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
Claude-3.5-Sonnet
Backbone Model=Claude-...
2025.07
70.4
GPT-4o-Mini
Backbone Model=GPT-4o-...
2025.07
66.1
o1-Mini
Backbone Model=o1-Mini...
2025.07
66.1
Llama-3.1-70B-Instruct
Backbone Model=Llama-3...
2025.07
65.2
Llama-3.1-405B-Instruct
Backbone Model=Llama-3...
2025.07
65.2
GPT-4o
Backbone Model=GPT-4o,...
2025.07
64.3
Llama-3.1-8B SFT+RL (Ours)
Backbone Model=Llama-3...
2025.07
63.3
Llama-3.3-70B Instruct (Teacher)
Backbone Model=Llama-3...
2025.07
61.9
Qwen-2.5-7B SFT+RL (Ours)
Backbone Model=Qwen-2....
2025.07
61.5
Qwen-2.5-72B Instruct (Teacher)
Backbone Model=Qwen-2....
2025.07
59
Llama-3.1-8B SFT (Ours)
Backbone Model=Llama-3...
2025.07
56.7
Qwen-2.5-7B SFT (Ours)
Backbone Model=Qwen-2....
2025.07
56.5
Qwen-2.5-7B RL (Ours)
Backbone Model=Qwen-2....
2025.07
53
Llama-3.1-8B RL (Ours)
Backbone Model=Llama-3...
2025.07
43.5
Qwen-2.5-7B Instruct (Student)
Backbone Model=Qwen-2....
2025.07
37
Llama-3.1-8B Instruct (Student)
Backbone Model=Llama-3...
2025.07
36.4
Feedback
Search any
task
Search any
task