Share your thoughts, 1 month free Claude Pro on usSee more

Open-ended reasoning on WildBench

57.05Creative Score

PPO w/ Structure Reward

Updated 3mo ago

Evaluation Results

Method	Links
PPO w/ Structure Reward 2026.03		57.05	45.87	27.7	53.47	30.05	40.26
GRPO w/ Structure Reward 2026.03		55.34	43.56	26.83	51.98	29.91	39.01
DPO 2026.03		51.16	37.82	17.06	48.37	14.34	30.34
Base 2026.03		51.01	36.23	16.35	48.71	14.72	29.91
GRPO w/ Entropy Minimization (EMPO) 2026.03		51.01	36.11	17.62	45.69	12.92	29.2