Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reward Hacking Mitigation on Excessive HH Harmless 1.0 (Evaluation)

8.2Reference Error Rate

IR3 Method B (Adversarial)

7.59211.69615.819.904Feb 23, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
8.291.2
2026.02
8.890.9
2026.02
10.890.8
2026.02
11.590.2
2026.02
14.590.5
2026.02
16.889
2026.02
18.288.2
2026.02
19.589.2
2026.02
20.189.5
23.489.8