Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Next-state prediction on WebShop
Loading...
79.05
EM Accuracy
Qwen2.5-7B
44.8028
53.6939
62.585
71.4761
Dec 21, 2025
EM Accuracy
Updated 2d ago
Evaluation Results
Method
Method
Links
EM Accuracy
Qwen2.5-7B
Evaluation Protocol=SFT
2025.12
79.05
Llama3.1-8B
Evaluation Protocol=SFT
2025.12
77.24
Gemini-2.5-flash
Evaluation Protocol=Fe...
2025.12
66.09
GPT-5
Evaluation Protocol=Fe...
2025.12
65.9
GPT-4o
Evaluation Protocol=Fe...
2025.12
64.62
GPT-4.1
Evaluation Protocol=Fe...
2025.12
64.23
GPT-4-turbo
Evaluation Protocol=Fe...
2025.12
62.76
GPT-4o-mini
Evaluation Protocol=Fe...
2025.12
61.93
Claude-sonnet-4.5
Evaluation Protocol=Ze...
2025.12
58.8
GPT-4o
Evaluation Protocol=Ze...
2025.12
58.2
GPT-4.1
Evaluation Protocol=Ze...
2025.12
58.07
Gemini-2.5-flash
Evaluation Protocol=Ze...
2025.12
57.64
Claude-sonnet-4.5
Evaluation Protocol=Fe...
2025.12
56.65
GPT-4o-mini
Evaluation Protocol=Ze...
2025.12
56.59
GPT-4-turbo
Evaluation Protocol=Ze...
2025.12
52.45
GPT-5
Evaluation Protocol=Ze...
2025.12
46.12
Feedback
Search any
task
Search any
task