Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HH-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety AlignmentHH-RLHF
MD Rate1.09
68
Helpful and Harmless Preference ReasoningHH-RLHF
Accuracy54.3
56
Preference AlignmentHH-RLHF (test)
Win Rate87.4
36
Preference AlignmentHH-RLHF
ASR99.4
32
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)
Helpfulness Win Rate89.42
31
Preference ModelingHH-RLHF
Accuracy61.4
30
LLM AlignmentHH-RLHF (test)
Diversity0.87
23
Question AnsweringHH-RLHF
Accuracy59
22
Safety EvaluationHH-RLHF (test)
Harm Score1.02
21
Helpful DialogueAnthropic HH-RLHF helpful core250 (test)
Reward Score18.93
18
LLM Judgement Confidence EstimationHH-RLHF (test)
RK0.4763
16
LLM AlignmentHH-RLHF 300 prompts
Win/Tie Rate vs Vanilla (GPT-4o)69.8
16
RLHFHH-RLHF
Human Win Rate74
16
RLHF AlignmentHH-RLHF (held-out)
Win Rate78
14
LLM-as-a-judgeHH-RLHF
Coverage81.3
12
Reward ModelingHH-RLHF helpful core250 (held-out evaluation)
Reward Score20.155
12
Best-of-N AlignmentHH-RLHF (test)
Percent batches with BWR > 0.5098
12
AlignmentHH-RLHF
Estimated Score (EST)154
12
Best-of-N AlignmentHH-RLHF
BWR53
12
Reward model verificationHH-RLHF
Win Rate47.3
12
Harmlessness evaluationHH-RLHF harmless (test)
Win Rate83.33
12
Confidence EstimationHH-RLHF
Rank Correlation (RK)0.4718
11
Helpful AssistantHH-RLHF
HV Score9.08
10
RLHFHH-RLHF (held-out)
Peak Gold Reward1.59
9
Certified Poisoning StabilityHH-RLHF
FTS@1100
9
Showing 25 of 47 rows