Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Next-state prediction on TextWorld (TW)
Loading...
70.6
EM Accuracy
Qwen2.5-7B
-2.824
16.238
35.3
54.362
Dec 21, 2025
EM Accuracy
Updated 2d ago
Evaluation Results
Method
Method
Links
EM Accuracy
Qwen2.5-7B
Evaluation Protocol=SFT
2025.12
70.6
Llama3.1-8B
Evaluation Protocol=SFT
2025.12
70.45
Claude-sonnet-4.5
Evaluation Protocol=Fe...
2025.12
49.12
GPT-5
Evaluation Protocol=Fe...
2025.12
44.27
Gemini-2.5-flash
Evaluation Protocol=Fe...
2025.12
40.35
Claude-sonnet-4.5
Evaluation Protocol=Ze...
2025.12
17.7
GPT-4o
Evaluation Protocol=Fe...
2025.12
14.11
GPT-4.1
Evaluation Protocol=Fe...
2025.12
13.39
GPT-4-turbo
Evaluation Protocol=Fe...
2025.12
11.66
GPT-4o-mini
Evaluation Protocol=Fe...
2025.12
11.43
GPT-5
Evaluation Protocol=Ze...
2025.12
9.2
GPT-4o
Evaluation Protocol=Ze...
2025.12
7.86
Gemini-2.5-flash
Evaluation Protocol=Ze...
2025.12
3.51
GPT-4o-mini
Evaluation Protocol=Ze...
2025.12
0.36
GPT-4-turbo
Evaluation Protocol=Ze...
2025.12
0
GPT-4.1
Evaluation Protocol=Ze...
2025.12
0
Feedback
Search any
task
Search any
task