Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

About

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy (Acc)77.84
337
Mathematical ReasoningMATH
Accuracy46.87
44
Commonsense ReasoningHellaSwag
HS Score14.11
43
Commonsense ReasoningWinoGrande
Accuracy81.64
23
Question AnsweringMedQA
HS13.05
17
Showing 5 of 5 rows

Other info

Follow for update