Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

About

While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that Safe LoRA is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, Safe LoRA mitigates the negative effect made by malicious data while preserving performance on downstream tasks. Our codes are available at \url{https://github.com/IBM/SafeLoRA}.

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@111.89
850
Mathematical ReasoningGSM8K (test)
Accuracy22.61
797
Safety EvaluationHarmful Benchmarks (CATQA, HEX-PHI, Salad-Base)
CATQA Score99.94
24
Jailbreak DefenseJailbreak Attack Benchmarks (GPTFuzz, TAP, GCG, AutoDAN, Template)
GPTFuzz ASR74.73
24
Sentiment AnalysisSST2
Attack Success Rate (ASR)72.4
17
Chinese Language UnderstandingMMMLU
MMMLU Score22.61
8
Code GenerationCode
ASR35.5
7
Mathematical ReasoningGSM8K
ASR24.4
7
Showing 8 of 8 rows

Other info

Follow for update