Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Reward Hacking Mitigation on Synthetic Goodhart 1.0 (Evaluation)
Loading...
4.38
R_g
IR3 Method B (Adversarial)
3.4856
3.7178
3.95
4.1822
Feb 23, 2026
R_g
Gap
Updated 4d ago
Evaluation Results
Method
Method
Links
R_g
Gap
IR3 Method B (Adversarial)
type=Adversarial
2026.02
4.38
0.41
IR3 Method C (Constrained)
type=Constrained
2026.02
4.35
0.45
IR3 Method A (Clean RL)
type=Clean RL
2026.02
4.21
0.62
IR3 Method D (Distillation)
type=Distillation
2026.02
4.12
0.71
InfoRM
2026.02
3.95
1.08
Length Penalty
alpha=5e-4
2026.02
3.85
1.32
KL Regularization
beta=2e-2
2026.02
3.78
1.42
Reward Clipping
c=4
2026.02
3.72
1.48
PPO interface clipping
epsilon=0.1
2026.02
3.68
1.55
PPO on R_proxy
2026.02
3.52
1.86
Feedback
Search any
task
Search any
task