Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Correctness verification on GSM-Symbolic
Loading...
0.435
LB
Beaver
0.33724
0.36262
0.388
0.41338
Dec 5, 2025
LB
UB
Gap
N
Updated 4d ago
Evaluation Results
Method
Method
Links
LB
UB
Gap
N
Beaver
Model=Llama3.3-70B
2025.12
0.435
0.454
0.019
33.33
RS
Model=Llama3.3-70B
2025.12
0.43
0.552
0.122
59.63
Beaver
Model=Qwen3-30B-A3B
2025.12
0.404
0.426
0.022
38.58
Beaver
Model=Qwen2.5-14B
2025.12
0.395
0.439
0.044
51.54
RS
Model=Qwen3-30B-A3B
2025.12
0.384
0.541
0.157
72.91
RS
Model=Qwen2.5-14B
2025.12
0.356
0.704
0.348
85.39
Beaver
Model=Qwen3-4B
2025.12
0.343
0.356
0.013
24.95
RS
Model=Qwen3-4B
2025.12
0.341
0.433
0.092
49.02
Feedback
Search any
task
Search any
task