Share your thoughts, 1 month free Claude Pro on usSee more

Mathematical Reasoning on Minerva (avg@8)

62.45Avg@8

GRPO w/ Ground-Truth

Updated 3mo ago

Evaluation Results

Method	Links
GRPO w/ Ground-Truth 2026.03		62.45
GRPO w/ Structure Reward 2026.03		61.99
GRPO w/ Majority Voting (TTRL) 2026.03		61.86
GRPO w/ Entropy Minimization (EMPO) 2026.03		61.44
PPO w/ Structure Reward 2026.03		61.08
PPO w/ Ground-Truth 2026.03		59.74
Base 2026.03		53.45
Qwen2.5-7B-Instruct 2025.08		46.83
Qwen2.5-7B-Instruct 2025.08		45.27
Qwen2.5-7B-Instruct 2025.08		43.66
Qwen2.5-7B-Instruct 2025.08		43.33
Llama3.1-8B-Instruct 2025.08		37.55
Llama3.1-8B-Instruct 2025.08		33.09
Llama3.1-8B-Instruct 2025.08		32.4
Llama3.1-8B-Instruct 2025.08		32.17