PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
About
In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Data is available at https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval | Win Rate50.87 | 420 | |
| Mathematical Reasoning | GSM8K | Accuracy (Acc)77 | 337 | |
| Safety Evaluation | HarmBench | -- | 148 | |
| Safety Evaluation | AdvBench | -- | 117 | |
| Truthfulness Evaluation | TruthfulQA | Accuracy71 | 108 | |
| Safety Evaluation | StrongREJECT | Attack Success Rate15 | 77 | |
| Red-teaming Safety Evaluation | StrongREJECT | ASR19 | 53 | |
| Safety Evaluation | CategoricalHarmfulQA Alpaca fine-tuning (test) | ASR Delta (S1-S5)3.05 | 42 | |
| Safety Evaluation | AdvBench Safety Evaluation | ASR (S1)35 | 42 | |
| Red-teaming Safety Evaluation | HarmBench | ASR4 | 32 |