Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Preference evaluation on Anthropic-SafeRLHF benchmark

33.7Win Rate

πbias (rubric-based preference attack)

23.50826.15428.831.446Feb 14, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
33.7
2026.02
23.9