Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HH-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety AlignmentHH-RLHF
MD Rate1.09
36
Assistant Response Alignment (Helpfulness and Harmlessness)HH-RLHF (test)
Helpfulness Win Rate89.42
31
Safety EvaluationHH-RLHF (test)
Harm Score1.02
21
LLM AlignmentHH-RLHF (test)
Win Rate80.3
21
RLHFHH-RLHF
Human Win Rate74
16
Reward model verificationHH-RLHF
Win Rate47.3
12
Harmlessness evaluationHH-RLHF harmless (test)
Win Rate83.33
12
Certified Poisoning StabilityHH-RLHF
FTS@1100
9
Dialogue generationfull-hh-rlhf (test)
Win Rate (Beaver-7b-v3.0-reward)79.3
8
Helpfulness evaluationHH-RLHF helpful (test)
Helpfulness Fraction77
7
Validity CertificationHH-RLHF (test)
FTV@k=1100
6
Constitutional AI AlignmentHH-RLHF (test)
Likert Score Ranking4.596
6
Controllable multi-objective generationHH-RLHF Helpful vs Harmless (test)
Hypervolume1.24
6
HumorHH-RLHF (test)
Reward2.481
4
HarmlessnessHH-RLHF (test)
Reward2.772
4
HelpfulnessHH-RLHF (test)
Reward2.542
4
Controllable multi-objective generationHH-RLHF Helpful vs Humor (test)
Hypervolume1.24
4
Conversational AssistantHH-RLHF
Reward0.5
3
Showing 18 of 18 rows