Share your thoughts, 1 month free Claude Pro on usSee more

Reward Hacking Mitigation on Synthetic Goodhart 1.0 (Evaluation)

4.38R_g

IR3 Method B (Adversarial)

Updated 5mo ago

Evaluation Results

Method	Links
IR3 Method B (Adversarial) 2026.02		4.38	0.41
IR3 Method C (Constrained) 2026.02		4.35	0.45
IR3 Method A (Clean RL) 2026.02		4.21	0.62
IR3 Method D (Distillation) 2026.02		4.12	0.71
InfoRM 2026.02		3.95	1.08
Length Penalty 2026.02		3.85	1.32
KL Regularization 2026.02		3.78	1.42
Reward Clipping 2026.02		3.72	1.48
PPO interface clipping 2026.02		3.68	1.55
PPO on R_proxy 2026.02		3.52	1.86