Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HH-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety AlignmentHH-RLHF
MD Rate1.09
68
Helpful and Harmless Preference ReasoningHH-RLHF
Accuracy54.3
56
Preference AlignmentHH-RLHF (test)
Win Rate87.4
36
Preference AlignmentHH-RLHF
BLEU0.275
31
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)
Helpfulness Win Rate89.42
31
Safety EvaluationHH-RLHF (test)
Harm Score1.02
21
LLM AlignmentHH-RLHF (test)
Win Rate80.3
21
LLM AlignmentHH-RLHF 300 prompts
Win/Tie Rate vs Vanilla (GPT-4o)69.8
16
RLHFHH-RLHF
Human Win Rate74
16
Best-of-N AlignmentHH-RLHF (test)
Percent batches with BWR > 0.5098
12
AlignmentHH-RLHF
Estimated Score (EST)154
12
Best-of-N AlignmentHH-RLHF
BWR53
12
Reward model verificationHH-RLHF
Win Rate47.3
12
Harmlessness evaluationHH-RLHF harmless (test)
Win Rate83.33
12
Certified Poisoning StabilityHH-RLHF
FTS@1100
9
Dialogue generationfull-hh-rlhf (test)
Win Rate (Beaver-7b-v3.0-reward)79.3
8
Helpfulness evaluationHH-RLHF helpful (test)
Helpfulness Fraction77
7
Pairwise preference comparisonHH-RLHF held-out (test)
Win Rate53.02
6
Validity CertificationHH-RLHF (test)
FTV@k=1100
6
Constitutional AI AlignmentHH-RLHF (test)
Likert Score Ranking4.596
6
Controllable multi-objective generationHH-RLHF Helpful vs Harmless (test)
Hypervolume1.24
6
HH-RLHFHH-RLHF
Hyper-volume10.435
5
Model AlignmentHH-RLHF D3 (test)
Harmlessness BLEU Score32.77
5
Model AlignmentHH-RLHF D2 (test)
Harmlessness BLEU20.13
5
Model AlignmentHH-RLHF 0-shot (test)
Harmlessness BLEU62.68
5
Showing 25 of 32 rows