Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Task Performance on GDPVal 44 tasks (held-out)
Loading...
0.81
Mean Return
Oracle Rubric (Baseline)
0.5708
0.6329
0.695
0.7571
Dec 5, 2025
Mean Return
Updated 4d ago
Evaluation Results
Method
Method
Links
Mean Return
Oracle Rubric (Baseline)
N (Number of rollouts)=8
2025.12
0.81
GSPO Model
N (Number of rollouts)=8
2025.12
0.74
Oracle Rubric (Baseline)
N (Number of rollouts)=1
2025.12
0.7
SFT Model
N (Number of rollouts)=8
2025.12
0.68
GSPO Model
N (Number of rollouts)=1
2025.12
0.62
SFT Model
N (Number of rollouts)=1
2025.12
0.59
No Rubric (LLM-Judged)
N (Number of rollouts)=1
2025.12
0.58
No Rubric (LLM-Judged)
N (Number of rollouts)=8
2025.12
0.58
Feedback
Search any
task
Search any
task