Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Anthropic HH-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Helpfulness alignmentAnthropic hh-rlhf
Gold Reward3.36
14
Preference AlignmentAnthropic-hh-rlhf (test)
LLM-as-a-Judge Helpful Score5.83
12
Reward ModelingAnthropic/hh-rlhf HH-helpful core250
Delta RM0.292
6
Response DiversityAnthropic HH-RLHF
Preference Coverage82.5
6
LLM AlignmentAnthropic HH-RLHF 2022 (test)
Win Rate62
4
Preference LearningAnthropic HH-RLHF+VI Preference (test)
Overall Accuracy64
3
Showing 6 of 6 rows