PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

About

In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Data is available at https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, Yaodong Yang• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate50.87	420
Mathematical Reasoning	GSM8K	Accuracy (Acc)77	337
Safety Evaluation	HarmBench	--	148
Safety Evaluation	AdvBench	--	117
Truthfulness Evaluation	TruthfulQA	Accuracy71	108
Safety Evaluation	StrongREJECT	Attack Success Rate15	77
Red-teaming Safety Evaluation	StrongREJECT	ASR19	53
Safety Evaluation	CategoricalHarmfulQA Alpaca fine-tuning (test)	ASR Delta (S1-S5)3.05	42
Safety Evaluation	AdvBench Safety Evaluation	ASR (S1)35	42
Red-teaming Safety Evaluation	HarmBench	ASR4	32

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord