Preference Evaluation on Anthropic-SafeRLHF

41.7Win Rate

πbias (rubric-based preference attack)

Updated 4mo ago

Evaluation Results

Method	Links
πbias (rubric-based preference attack) 2026.02		41.7
πbias (rubric-based preference attack) 2026.02		34.1