Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Preference Evaluation on Anthropic-SafeRLHF

41.7Win Rate

πbias (rubric-based preference attack)

33.79635.84837.939.952Feb 14, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.02
41.7
2026.02
34.1