Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Reward Hacking Mitigation on Excessive HH Harmless 1.0 (Evaluation)
Loading...
8.2
Reference Error Rate
IR3 Method B (Adversarial)
7.592
11.696
15.8
19.904
Feb 23, 2026
Reference Error Rate
Safety Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Reference Error Rate
Safety Score
IR3 Method B (Adversarial)
type=Adversarial
2026.02
8.2
91.2
IR3 Method C (Constrained)
type=Constrained
2026.02
8.8
90.9
IR3 Method A (Clean RL)
type=Clean RL
2026.02
10.8
90.8
IR3 Method D (Distillation)
type=Distillation
2026.02
11.5
90.2
InfoRM
2026.02
14.5
90.5
Length Penalty
alpha=5e-4
2026.02
16.8
89
KL Regularization
beta=2e-2
2026.02
18.2
88.2
Reward Clipping
c=4
2026.02
19.5
89.2
PPO Clipping
epsilon=0.1
2026.02
20.1
89.5
PPO on R_proxy
2026.02
23.4
89.8
Feedback
Search any
task
Search any
task