Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

About

Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: https://anonymous.4open.science/r/IRR-BD4F.

Di Wu, Xin Lu, Yanyan Zhao, Bing Qin• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@119.02
850
Mathematical ReasoningGSM8K (test)
Accuracy42.91
797
Safety EvaluationHarmful Benchmarks (CATQA, HEX-PHI, Salad-Base)
CATQA Score99.7
24
Jailbreak DefenseJailbreak Attack Benchmarks (GPTFuzz, TAP, GCG, AutoDAN, Template)
GPTFuzz ASR59.56
24
Chinese Language UnderstandingMMMLU
MMMLU Score37.08
8
Showing 5 of 5 rows

Other info

Code

Follow for update