Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Lemma Judging on DeepTheorem sampled perturbation (test)
Loading...
90.9
Exact Match Accuracy
Llama-3.1 (RULES)
21.324
39.387
57.45
75.513
Feb 1, 2026
Exact Match Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Exact Match Accuracy
Llama-3.1 (RULES)
Optimization=RULES
2026.02
90.9
Llama-3.1 (vanilla)
Optimization=vanilla
2026.02
89.2
Qwen2.5 (RULES)
Optimization=RULES
2026.02
87.8
Qwen2.5 (GRPO)
Optimization=GRPO
2026.02
81.2
Llama-3.1 (GRPO)
Optimization=GRPO
2026.02
80.9
Qwen2.5 (vanilla)
Optimization=vanilla
2026.02
71.9
OLMO2 (RULES)
Optimization=RULES
2026.02
60.5
OLMO2 (vanilla)
Optimization=vanilla
2026.02
47.6
DeepMath (RULES)
Optimization=RULES
2026.02
47
OLMO2 (GRPO)
Optimization=GRPO
2026.02
46.8
DeepMath (GRPO)
Optimization=GRPO
2026.02
35.6
DeepMath (vanilla)
Optimization=vanilla
2026.02
24
Feedback
Search any
task
Search any
task