Share your thoughts, 1 month free Claude Pro on usSee more

PKU-SafeRLHF

Benchmarks

Task Name	Dataset Name	SOTA Result
Reward Modeling	PKU-SafeRLHF (test)	MAE0.074	36
Safety Alignment	PKU-SafeRLHF 30K (IID)	WR89.26	36
Safety Alignment Evaluation	PKU-SafeRLHF 30K (test)	Win Rate (WR)90.23	32
Human Preference Alignment	PKU-SafeRLHF	BLEU0.324	31
Safety Evaluation	PKU-SafeRLHF-V	Accuracy77.8	20
Helpful and Harmless Response Generation	PKU-SafeRLHF alpaca2-7b (test)	Helpfulness8.53	14
Safety alignment	PKU-SafeRLHF	Gold Reward3.92	14
LLM Safety Alignment	PKU-SafeRLHF full deduplicated (test)	Helpfulness8.22	12
LLM Alignment	PKU-SafeRLHF	BWR (Median)49	12
Best-of-N Alignment	PKU-SafeRLHF	Percent batches with BWR > 0.5038	12
Combined win rate evaluation	PKU-SafeRLHF prompts n = 100 samples (Sev-Low)	CVaR(0.125) Combined Win Rate53.6	10
Combined win rate evaluation	PKU-SafeRLHF Sev-3 prompts n = 100 samples	Combined Win Rate (CVaR 0.125)60.9	10
Combined win rate evaluation	PKU-SafeRLHF Conflict prompts n = 100 samples	CVaR(0.125) Combined Win Rate40.3	10
Combined win rate evaluation	PKU-SafeRLHF Random prompts n = 100 samples	CVaR(0.125) Combined Win Rate37.1	10
Safety Alignment Robustness Evaluation	PKU-SafeRLHF (n=100 samples)	Random Rate18.5	10
Safety Alignment	PKU-SafeRLHF in-distribution (test)	Accuracy (EN)99.44	10
Harmfulness Evaluation	PKU-SafeRLHF	Beaver-7B-Cost Score-1.11	10
Privacy Violation Detection	PKU-SafeRLHF	Acc87.5	9
Reward Model Transfer	PKU-SafeRLHF	AOG2.46	8
Preference Evaluation	PKU-SafeRLHF	Win Rate57	8
Safe RLHF Alignment	PKU-SafeRLHF 30K	Helpfulness6.51	7
Safety and Informativeness Evaluation	PKU-SafeRLHF (test)	Drugs & Weapons Safety Score85.3	6
Helpfulness	PKU-SafeRLHF 30K	Win Rate84.5	6
Harmlessness	PKU-SafeRLHF-30K	Win Rate87.25	6
Safety Detection	PKU-SafeRLHF (held-out)	AUROC90.6	5

Showing 25 of 44 rows