Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

About

Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety datasets are generally inaccessible, making it difficult to fully recover the model's original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method's practicality and effectiveness.

Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee• 2024

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
836
Multitask Language UnderstandingMMLU
Accuracy68.4
263
Medical Visual Question AnsweringVQA-RAD
Accuracy62.53
228
Question AnsweringPubMedQA
Accuracy77.2
145
Safety EvaluationHexPhi
Harmfulness4.5
140
Safety EvaluationDirectHarm
Harmfulness Score6.4
84
Medical Question AnsweringPubMedQA
Accuracy78.5
65
Safety EvaluationHEX-PHI (test)
Harmfulness Score (Llama-Guard-3B)5.3
56
Harmfulness EvaluationDirectHarm (test)
Harmfulness Score (Llama-Guard-3B)8.1
56
Harmfulness EvaluationDirectHarm
Harmfulness Score8.1
56
Showing 10 of 34 rows

Other info

Follow for update