PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
About
In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Data is available at https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | AdvBench | -- | 117 | |
| Safety Evaluation | StrongREJECT | Attack Success Rate15 | 45 | |
| Red-teaming Safety Evaluation | StrongREJECT | ASR19 | 32 | |
| Red-teaming Safety Evaluation | HarmBench | ASR4 | 32 | |
| Red-teaming Safety Evaluation | Edgebench | HS Score3.59 | 16 | |
| Red-teaming Safety Evaluation | SC-Safety | HS2.44 | 16 | |
| Red-teaming Safety Evaluation | Basebench | HS2.11 | 16 | |
| Red-teaming Safety Evaluation | XSTest | HPR39 | 8 | |
| Red-teaming Safety Evaluation | AdvBench | HPR34 | 8 | |
| Safety Evaluation | XSTest | HS Rate2.59 | 8 |