Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Next-state prediction on SciWorld
Loading...
98.64
EM Accuracy
Llama3.1-8B
9.6368
32.7434
55.85
78.9566
Dec 21, 2025
EM Accuracy
Updated 2d ago
Evaluation Results
Method
Method
Links
EM Accuracy
Llama3.1-8B
Evaluation Protocol=SFT
2025.12
98.64
Qwen2.5-7B
Evaluation Protocol=SFT
2025.12
98.6
Claude-sonnet-4.5
Evaluation Protocol=Fe...
2025.12
73.08
Gemini-2.5-flash
Evaluation Protocol=Fe...
2025.12
61.2
Claude-sonnet-4.5
Evaluation Protocol=Ze...
2025.12
56.83
GPT-4o-mini
Evaluation Protocol=Fe...
2025.12
56.26
GPT-4.1
Evaluation Protocol=Fe...
2025.12
51.56
GPT-4-turbo
Evaluation Protocol=Fe...
2025.12
50.08
GPT-5
Evaluation Protocol=Fe...
2025.12
49.44
GPT-4o
Evaluation Protocol=Fe...
2025.12
48.98
GPT-4o
Evaluation Protocol=Ze...
2025.12
45.78
Gemini-2.5-flash
Evaluation Protocol=Ze...
2025.12
44.81
GPT-4o-mini
Evaluation Protocol=Ze...
2025.12
40.68
GPT-4.1
Evaluation Protocol=Ze...
2025.12
35.65
GPT-4-turbo
Evaluation Protocol=Ze...
2025.12
34.14
GPT-5
Evaluation Protocol=Ze...
2025.12
13.06
Feedback
Search any
task
Search any
task