HH-RLHF

Benchmarks

Task Name	Dataset Name	SOTA Result
Safety Alignment	HH-RLHF	MD Rate1.09	68
Helpful and Harmless Preference Reasoning	HH-RLHF	Accuracy54.3	56
Preference Alignment	HH-RLHF (test)	Win Rate87.4	36
Preference Alignment	HH-RLHF	ASR99.4	32
Assistant Response Alignment (Helpfulness and Harmlessness)	HH-RLHF (test)	Helpfulness Win Rate89.42	31
Preference Modeling	HH-RLHF	Accuracy61.4	30
LLM Alignment	HH-RLHF (test)	Diversity0.87	23
Question Answering	HH-RLHF	Accuracy59	22
Safety Evaluation	HH-RLHF (test)	Harm Score1.02	21
Helpful Dialogue	Anthropic HH-RLHF helpful core250 (test)	Reward Score18.93	18
LLM Judgement Confidence Estimation	HH-RLHF (test)	RK0.4763	16
LLM Alignment	HH-RLHF 300 prompts	Win/Tie Rate vs Vanilla (GPT-4o)69.8	16
RLHF	HH-RLHF	Human Win Rate74	16
RLHF Alignment	HH-RLHF (held-out)	Win Rate78	14
LLM-as-a-judge	HH-RLHF	Coverage81.3	12
Reward Modeling	HH-RLHF helpful core250 (held-out evaluation)	Reward Score20.155	12
Best-of-N Alignment	HH-RLHF (test)	Percent batches with BWR > 0.5098	12
Alignment	HH-RLHF	Estimated Score (EST)154	12
Best-of-N Alignment	HH-RLHF	BWR53	12
Reward model verification	HH-RLHF	Win Rate47.3	12
Harmlessness evaluation	HH-RLHF harmless (test)	Win Rate83.33	12
Confidence Estimation	HH-RLHF	Rank Correlation (RK)0.4718	11
Helpful Assistant	HH-RLHF	HV Score9.08	10
RLHF	HH-RLHF (held-out)	Peak Gold Reward1.59	9
Certified Poisoning Stability	HH-RLHF	FTS@1100	9

Showing 25 of 47 rows