STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
About
This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy92 | 535 | |
| Mathematical Reasoning | AIME | AIME Accuracy83.3 | 283 | |
| Science Question Answering | ARC Challenge | -- | 234 | |
| Science Reasoning | GPQA | Accuracy58.6 | 218 | |
| Mathematical Reasoning | MATH 500 | pass@190.58 | 153 | |
| Reasoning | GPQA Diamond | -- | 88 | |
| Mathematical Reasoning | AIME 2024 | Pass@145.83 | 54 | |
| Safety Evaluation | StrongREJECT | Attack Success Rate3.51 | 45 | |
| Harmful Request Defense | AdvBench | ASR0.00e+0 | 44 | |
| Over-refusal | XSTest | -- | 42 |