Share your thoughts, 1 month free Claude Pro on usSee more

Lemma Judging on DeepTheorem sampled perturbation (test)

90.9Exact Match Accuracy

Llama-3.1 (RULES)

Updated 5mo ago

Evaluation Results

Method	Links
Llama-3.1 (RULES) 2026.02		90.9
Llama-3.1 (vanilla) 2026.02		89.2
Qwen2.5 (RULES) 2026.02		87.8
Qwen2.5 (GRPO) 2026.02		81.2
Llama-3.1 (GRPO) 2026.02		80.9
Qwen2.5 (vanilla) 2026.02		71.9
OLMO2 (RULES) 2026.02		60.5
OLMO2 (vanilla) 2026.02		47.6
DeepMath (RULES) 2026.02		47
OLMO2 (GRPO) 2026.02		46.8
DeepMath (GRPO) 2026.02		35.6
DeepMath (vanilla) 2026.02		24