GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning
About
Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy (Acc)77.84 | 337 | |
| Mathematical Reasoning | MATH | Accuracy46.87 | 44 | |
| Commonsense Reasoning | HellaSwag | HS Score14.11 | 43 | |
| Commonsense Reasoning | WinoGrande | Accuracy81.64 | 23 | |
| Question Answering | MedQA | HS13.05 | 17 |