Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Web Task Completion on MiniWoB++ Held-out Goals (test)
Loading...
70.5
Success Rate
Claude-3.5-Sonnet
27.86
38.93
50
61.07
Jul 5, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
Claude-3.5-Sonnet
Backbone Model=Claude-...
2025.07
70.5
o1-Mini
Backbone Model=o1-Mini...
2025.07
69.7
Llama-3.1-8B SFT+RL (Ours)
Backbone Model=Llama-3...
2025.07
67.2
Llama-3.1-405B-Instruct
Backbone Model=Llama-3...
2025.07
65.9
GPT-4o
Backbone Model=GPT-4o,...
2025.07
65.7
Qwen-2.5-7B SFT+RL (Ours)
Backbone Model=Qwen-2....
2025.07
65
Llama-3.3-70B Instruct (Teacher)
Backbone Model=Llama-3...
2025.07
63.2
Qwen-2.5-72B Instruct (Teacher)
Backbone Model=Qwen-2....
2025.07
61
Llama-3.1-70B-Instruct
Backbone Model=Llama-3...
2025.07
57
Qwen-2.5-7B SFT (Ours)
Backbone Model=Qwen-2....
2025.07
57
GPT-4o-Mini
Backbone Model=GPT-4o-...
2025.07
56.2
Llama-3.1-8B SFT (Ours)
Backbone Model=Llama-3...
2025.07
55.2
Qwen-2.5-7B RL (Ours)
Backbone Model=Qwen-2....
2025.07
52.5
Llama-3.1-8B RL (Ours)
Backbone Model=Llama-3...
2025.07
43.5
Qwen-2.5-7B Instruct (Student)
Backbone Model=Qwen-2....
2025.07
32.8
Llama-3.1-8B Instruct (Student)
Backbone Model=Llama-3...
2025.07
29.5
Feedback
Search any
task
Search any
task