SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

About

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that restores safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective, layer-wise merging offers a robust safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple yet effective post-fine-tuning defense.

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	854
Multitask Language Understanding	MMLU	Accuracy68.9	263
Question Answering	BoolQ	Accuracy89.1	233
Safety Evaluation	HexPhi	Harmfulness3.8	140
Safety Evaluation	DirectHarm	Harmfulness Score5.9	84
Natural Language Inference	MNLI	--	80
Medical Question Answering	PubMedQA	Accuracy80.3	65
Safety Evaluation	HEX-PHI (test)	Harmfulness Score (Llama-Guard-3B)4.3	56
Harmfulness Evaluation	DirectHarm (test)	Harmfulness Score (Llama-Guard-3B)7.5	56
Harmfulness Evaluation	DirectHarm	Harmfulness Score7.5	56

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord