PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
About
In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Data is available at https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | AdvBench | -- | 117 | |
| Safety Evaluation | StrongREJECT | Attack Success Rate15 | 65 | |
| Red-teaming Safety Evaluation | StrongREJECT | ASR19 | 32 | |
| Red-teaming Safety Evaluation | HarmBench | ASR4 | 32 | |
| Response Moderation | Public Benchmarks for Response Moderation (SafeRLHF, WildGuard, HarmBench, BeaverTails, XSTest, Aegis 2.0) | SafeRLHF Score90.74 | 30 | |
| Mathematical Reasoning | GSM8K | Retention73.39 | 28 | |
| Mathematical Reasoning | MathQA | Retention25.03 | 28 | |
| Question Answering | Medical Multiple Choice (MedQA, PubMedQA, MedMCQA, HeadQA) | Average Accuracy50.09 | 28 | |
| Harmful Question Forgetting | Harm-1 GPTFUZZER WildAttack | ASR3 | 28 | |
| Mathematical Reasoning | MATH | Retention22.14 | 28 |